Bug 205610

Summary: Assertion failure in log_do_checkpoint
Product: Red Hat Enterprise Linux 4 Reporter: Stuart Hayes <stuart_hayes>
Component: kernelAssignee: Eric Sandeen <esandeen>
Status: CLOSED DUPLICATE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4CC: admin, jbaron, jfeeney, markscarbrough, plankers, rdejean, sct, tao, wwlinuxengineering
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-04-11 22:17:46 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
jbd patch none

Description Stuart Hayes 2006-09-07 16:38:30 UTC
Description of problem:

While running stress for 1-2 days on a Dell PE1900, system got the following 
error:

Assertion failure in log_do_checkpoint() at 
fs/jbd/checkpoint.c:363: "drop_count != 0 || cleanup_ret != 0"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at checkpoint:363
invalid operand: 0000 [1] SMP 
CPU 0 
Modules linked in: sg ipmi_devintf ipmi_si ipmi_msghandler mptctl mptbase 
dell_rbu md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core sunrpc ds 
yenta_socket pcmcia_core sr_mod usb_storage button battery ac joydev uhci_hcd 
ehci_hcd hw_random e1000 bnx2 floppy dm_snapshot dm_zero dm_mirror ext3 jbd 
dm_mod lpfc scsi_transport_fc megaraid_sas aic79xx(U) sd_mod scsi_mod
Pid: 27340, comm: sendmail Not tainted 2.6.9-42.ELsmp
RIP: 0010:[<ffffffffa00b3c80>] <ffffffffa00b3c80>{:jbd:log_do_checkpoint+988}
RSP: 0018:0000010001ec1af8  EFLAGS: 00010216
RAX: 000000000000006e RBX: 0000010075e62298 RCX: ffffffff803e1ee8
RDX: ffffffff803e1ee8 RSI: 0000000000000246 RDI: ffffffff803e1ee0
RBP: 0000010039be1978 R08: ffffffff803e1ee8 R09: 0000010075e62298
R10: 0000000100000000 R11: 0000ffff803fcc00 R12: 000001007ded9a00
R13: 0000010039be1978 R14: 0000000000000000 R15: 000001007da7cdc0
FS:  0000002a96e91660(0000) GS:ffffffff804e5080(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000552ad64a08 CR3: 0000000000101000 CR4: 00000000000006e0
Process sendmail (pid: 27340, threadinfo 0000010001ec0000, task 
000001006806e030)
Stack: 0013c6b000000000 0000010039be1978 0000010039be1978 0000000000000000 
       0000000000000007 0000000000000000 0000000000000000 ffffffff8017ae4c 
       000001007599fc48 0000000000000001 
Call Trace:<ffffffff8017ae4c>{__find_get_block_slow+255} <ffffffff8017b618>
{__find_get_block+396} 
       <ffffffff8017d914>{__getblk+42} <ffffffff801e9fc0>
{radix_tree_delete+325} 
       <ffffffff8015a298>{find_get_page+65} <ffffffff8015db1a>
{__pagevec_free+39} 
       <ffffffffa00b3d98>{:jbd:__log_wait_for_space+209} <ffffffffa00af397>
{:jbd:start_this_handle+919} 
       <ffffffff8015a475>{find_get_pages+88} <ffffffff80163faf>
{pagevec_lookup+23} 
       <ffffffff80164277>{truncate_inode_pages+449} <ffffffff8018aa09>
{fasync_helper+204} 
       <ffffffffa00af4c6>{:jbd:journal_start+223} <ffffffffa00c4358>
{:ext3:start_transaction+25} 
       <ffffffffa00c767e>{:ext3:ext3_delete_inode+0} <ffffffffa00c769a>
{:ext3:ext3_delete_inode+28} 
       <ffffffffa00c767e>{:ext3:ext3_delete_inode+0} <ffffffff8019242c>
{generic_delete_inode+190} 
       <ffffffff8018f1b0>{dput+463} <ffffffff8017a2cd>{__fput+253} 
       <ffffffff80178e44>{filp_close+103} <ffffffff80178ecd>{sys_close+130} 
       <ffffffff8011026a>{system_call+126} 

Code: 0f 0b b4 7d 0b a0 ff ff ff ff 6b 01 4d 39 7c 24 60 0f 84 45 
RIP <ffffffffa00b3c80>{:jbd:log_do_checkpoint+988} RSP <0000010001ec1af8>
 <0>Kernel panic - not syncing: Oops


Version-Release number of selected component (if applicable):
The dump shows that this was kernel version 2.6.9-42.ELsmp (RHEL4 U4).

How reproducible:
This has only been seen once so far.  We are in the process of trying to 
reproduce the issue.


Steps to Reproduce:
1. Install RHEL4 U4 x86_64 on Dell PE1900 with 4GB memory, a RAID1 (4 physical 
drives) on PERC5/i, a RAID5 (4 phys. drives) on PERC5/e channel 0, and a RAID5 
(7 phys. drives) on PERC5/e channel 1.
2. Run dovecot & sendmail stress.
3. Observe failure (see output above).

Actual results:
System failed with message above.

Expected results:
System should continue to run.


Additional info:
Note that there is an almost identical bug (BZ 162814), but that bug was 
supposedly fixed in RHEL4 U3, and this bug was found running RHEL4 U4.

Comment 1 Stuart Hayes 2006-09-08 16:16:27 UTC
Searching the mailing lists, I came up with this.  I have no idea if this will 
actually fix the bug we're seeing or not.  This patch apparently did go into 
the mainstream kernel, and is not in RHEL4 U4.

http://marc.theaimsgroup.com/?l=linux-kernel&m=111877954110234&w=2


Comment 2 Stuart Hayes 2006-09-20 16:41:40 UTC
The patch from comment #1 appears to not be relevant to RHEL4.  However, I 
have found this, which seems to indicate that there are other race conditions 
in this code that can trigger this bug, and this patch attempts to fix those.

http://marc.theaimsgroup.com/?l=linux-kernel&m=114659569713241&w=2


Comment 3 Eric Sandeen 2006-09-20 19:19:33 UTC
Stuart, what's the status on reproducing this one?

Comment 4 Stuart Hayes 2006-09-20 19:48:53 UTC
The same system ran for 120+ hours with no failure.  At that point, RHEL4 was 
reinstalled, and it ran for another 120+ hours with no failure.  So we're 
assuming this is a really rare race condition.


Comment 5 Jeff Layton 2006-10-03 15:16:52 UTC
Took a brief look at the patch linked above, it looks like it removes this
assertion altogether and so would be a large-hammer approach to avoiding it.

The downside is that the patch is *really* big, and needs to be carefully looked
at to make sure it doesn't break kabi. A reproducer is really needed here at
this point.



Comment 6 Eric Sandeen 2006-10-03 15:19:28 UTC
*** Bug 206393 has been marked as a duplicate of this bug. ***

Comment 11 Bob Plankers 2006-10-31 22:29:52 UTC
Hi folks,

I am seeing this as well, on a Dell PowerEdge 1850 running RHEL AS 4 kernel
version 2.6.9-42.0.3.ELsmp. This server is a mail server (sendmail), and under
particularly high I/O loads it panics with the same error (yes, the server is
named 'UP'):

Oct 31 15:26:33 up kernel: Assertion failure in log_do_checkpoint() at fs/jbd/ch
eckpoint.c:363: "drop_count != 0 || cleanup_ret != 0"
Oct 31 15:26:33 up kernel: ------------[ cut here ]------------
Oct 31 15:26:33 up kernel: kernel BUG at fs/jbd/checkpoint.c:363!
Oct 31 15:26:33 up kernel: invalid operand: 0000 [#1]
Oct 31 15:26:33 up kernel: SMP 
Oct 31 15:26:33 up kernel: Modules linked in: md5 ipv6 i2c_dev i2c_core ipmi_dev
intf ipmi_si ipmi_msghandler ipt_REJECT ipt_state ip_conntrack iptable_filter ip
_tables button battery ac joydev uhci_hcd ehci_hcd hw_random e1000 floppy sg dm_
snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi
_mod
Oct 31 15:26:33 up kernel: CPU:    2
Oct 31 15:26:33 up kernel: EIP:    0060:[<f8880fc0>]    Not tainted VLI
Oct 31 15:26:33 up kernel: EFLAGS: 00010216   (2.6.9-42.0.3.ELsmp) 
Oct 31 15:26:33 up kernel: EIP is at log_do_checkpoint+0x10e/0x148 [jbd]
Oct 31 15:26:33 up kernel: eax: 0000006e   ebx: f13df2fc   ecx: d3a29d84   edx: 
f8884dac
Oct 31 15:26:33 up kernel: esi: f7e0de00   edi: f7091200   ebp: f13df2fc   esp: 
d3a29d80
Oct 31 15:26:33 up kernel: ds: 007b   es: 007b   ss: 0068
Oct 31 15:26:33 up kernel: Process sendmail (pid: 5020, threadinfo=d3a29000 task
=cf870630)

I have seen this two days in a row as an automated system flushes its queue at
this mail server. This isn't quite the ability to reproduce it, but I'd be glad
to do anything you might need to better document the issue.

Comment 12 Eric Sandeen 2006-10-31 22:48:37 UTC
Bob, thanks for the report.  I'm thinking that the patch referenced in comment
#2 likely addresses the issue (if you hit the assertion a lot, we could verify
this) but it has KABI issues which must be addressed, and this is going to take
some time to work out.

Comment 13 Bob Plankers 2006-10-31 22:53:15 UTC
Thanks Eric. I understand the KABI issues and don't envy the work that needs to
happen to maintain that. Do you have any suggestions regarding a workaround in
the meantime? My only though was putting a mainstream kernel on the box, built
with roughly the same config as the 2.6.9-43.0.3, while the fix is pending.

Comment 14 Eric Sandeen 2006-10-31 23:01:01 UTC
I'm afraid that I do not have a good workaround for you at the moment.  I'll see
if I can work something out soon for this bug... looks like it's getting a bit
more popular.

Comment 15 Samuel Benjamin 2007-01-10 22:50:13 UTC
Need to move this one for 4.5 per Dell request. Set ACK review flags. 3 IT's
associated with this bug.

Comment 16 Eric Sandeen 2007-01-18 18:23:45 UTC
I've finally had some time to look at this one a bit more.

The patch in comment #2 looks pretty good from the perspective that since it
went upstream, it's had no further changes, and it's been there for a long time
for soaking.  No good ideas about the KABI issues yet though - adding the extra
struct member clobbers jbd KABI.

However, there seems to be another obvious bug that was fixed here:

http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.3104.1.2

  Fix a bug in list scanning that can cause us to skip the last buffer on the
  checkpoint list (and hence fail to do any progress under some rather
  unfavorable conditions).

in 2.6.10, which looks obviously correct & probably should go in, although I'm
not sure it will address all the problems folks are seeing here.

Comment 18 John Feeney 2007-01-31 23:33:19 UTC
Dell said it would test the fix when available. Currently, they are trying to 
create a test environment where the bug manifests in a more consistent manner. 

Comment 20 John Feeney 2007-02-08 17:10:22 UTC
I haven't heard of anything since I asked last week. Will ask again. 

Comment 21 John Feeney 2007-02-08 17:14:26 UTC
Charles, any word on getting this reproduced? It appears as though the lack 
of a reproducer is gating the resolution. Setting to Needinfo.

Comment 24 Charles Rose 2007-02-20 05:23:41 UTC
We are still exploring ways to reproduce this issue.

Comment 25 ray dejean 2007-02-21 21:29:33 UTC
Well i'm seeing the same problem here.  RHEL 4, kernel 2.6.9-42 and 2.6.9-42.0.8
also.  On IBM x360 server.  This server is also a mail server, running qpopper
pop and dovecot imap server for 15,000 students and 2,000 faculty/staff.  So yes
it gets pretty busy with I/O at times.  Disk storage is IBM DS4700.

This past week it's gotten really bad, crashing once or twice per day.  Bosses
are complaining, haha.  I'm going to try the small patch mentioned in comment
#16 to see if that helps.  Will post our results.

Comment 26 Eric Sandeen 2007-02-21 21:34:34 UTC
Ray, thanks.  I look forward to the results of your test...

Comment 27 ray dejean 2007-02-23 20:33:07 UTC
Well it's been up for 40 hours with no problems.  I'll be out all next week
snowboarding in Breckenridge.  So if it stays up for a week solid and i get no
calls on the mountain, that'll be the true test.

Comment 28 Eric Sandeen 2007-03-01 21:00:00 UTC
Ray, still going strong?

If anyone else is hitting this, can you please try the same patch?  If you need
a kernel rpm built w/ the change I'll gladly do that.

Comment 29 Eric Sandeen 2007-03-01 21:05:25 UTC
*** Bug 203847 has been marked as a duplicate of this bug. ***

Comment 30 ray dejean 2007-03-05 04:56:26 UTC
(In reply to comment #28)
> Ray, still going strong?
> 
> If anyone else is hitting this, can you please try the same patch?  If you need
> a kernel rpm built w/ the change I'll gladly do that.

Yep, still up after 10 days.  Below is a list of our crashes/reboots until i
patched 2.6.9-42.0.8.ELsmp on Feb 21.  I'd say this patch definitely helped.

<pre>
reboot   system boot  2.6.9-42.0.8.ELs Wed Feb 21 23:36         (10+23:11)
reboot   system boot  2.6.9-42.0.8.ELs Wed Feb 21 11:03          (12:29)
reboot   system boot  2.6.9-42.0.8.ELs Wed Feb 21 00:34          (22:59)
reboot   system boot  2.6.9-42.ELsmp   Tue Feb 20 14:31          (09:59)
reboot   system boot  2.6.9-42.ELsmp   Tue Feb 20 04:15          (20:14)
reboot   system boot  2.6.9-42.ELsmp   Mon Feb 19 07:30         (1+17:00)
reboot   system boot  2.6.9-42.ELsmp   Sun Feb 18 14:27         (2+10:03)
reboot   system boot  2.6.9-42.ELsmp   Fri Feb 16 12:37         (4+11:53)
reboot   system boot  2.6.9-42.ELsmp   Wed Feb 14 07:13         (6+17:17)
</pre>

Comment 32 Eric Sandeen 2007-03-05 22:17:22 UTC
Because it is an obviously correct fix (though at the time of filing, not
obviously the correct fix for -this- issue in particular), the patch in comment
#16 got its own bug,

  224638: jbd __cleanup_transaction skips last buffer on checkpoint list

It's been submitted for inclusion in a future RHEL4 kernel release.

Since it fixes at least one reporter's problems with this bug, dup'ing of one
bug to the other may be in order.

Comment 36 Peter Martuccelli 2007-03-21 14:53:47 UTC
Eric please post the patch for review.

Comment 37 Eric Sandeen 2007-03-21 15:08:59 UTC
Created attachment 150590 [details]
jbd patch

Peter, the patch was already sent internally, and ACKed, for bug 224638. 
Attaching here for completeness.

It was not clear at the time that it would resolve the issues in -this- bug,
because there was another possibility that might have been causing the problem,
so I created a new bug for this simple, obviously correct fix.	So far, though,
it looks like this patch *is* resolving this issue for the reporters who have
chimed in.  So, not sure what to do with this bug, administration-wise?

Comment 39 Samuel Benjamin 2007-03-23 02:15:34 UTC
Can engineering provide a test kernel or package as applicable so Dell can
verify if this problem has been resolved with this patch? Thanks.

Comment 43 Eric Sandeen 2007-03-30 20:52:33 UTC
I have placed 42.0.10 kernels with the added fix for bug 224638 (which seems to
have resolved this issue for some people) at:

http://people.redhat.com/esandeen/bz224638/

If anyone needs a kernel other than i686 and x86_64, smp, please let me know.

Note these kernels are only for testing, and are not official RHEL4 released
kernels.

Thanks,
-Eric

Comment 44 Suzanne Logcher 2007-04-11 22:11:34 UTC
Although this bugzilla was approved for RHEL 4.5, it remained in an unresolved
status and no patch was included in the 4.5 kernel.
Moved to 4.6.

Comment 45 Eric Sandeen 2007-04-11 22:17:46 UTC
Duping to bug 224638 for now, patch which seems to fix this issue was sent under
that bug.

*** This bug has been marked as a duplicate of 224638 ***