189812 – repeatable ext3 hang on 2.6.16-1.2069 (and+); works fine with ext2 or with earlier kernels

Bug 189812 - repeatable ext3 hang on 2.6.16-1.2069 (and+); works fine with ext2 or with earlier kernels

Summary: repeatable ext3 hang on 2.6.16-1.2069 (and+); works fine with ext2 or with ea...

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-04-24 20:28 UTC by Matthew Miller
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-08-04 20:15:15 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Magic SysRq t and p (17.49 KB, text/plain) 2006-07-27 16:30 UTC, Matthew Miller	no flags	Details
more magic sysrq output from another run (18.39 KB, text/plain) 2006-07-27 16:31 UTC, Matthew Miller	no flags	Details
View All

Description Matthew Miller 2006-04-24 20:28:26 UTC

Description of problem:

I know this isn't supported, but I'm reporting it anyway as it seems like it's
probably symptomatic of a general problem. I'm rebuilding anaconda and an FC4
tree with the latest updates integrated. This works fine with kernel
2.6.15-1.1833_FC4, but when I updated to 2.6.16-1.2069, the installer reliably
hangs somewhere in the process of installing packages. The problem also occurs
with 2.6.16-1.2096_FC4, but went away when I reverted to 2.6.15-1.1833_FC4.

The place where the hang happens varies from system to system and depending on
options chosen, but is usually about 1/10 to 1/3 of the way though -- I haven't
had an install complete. This is on a wide variety of hardware, ranging from a
Pentium III SMP system (i686 kernel) to some new Opteron and Athlon64 machines. 

Happens in both text and GUI modes -- the system totally hangs, including the
mouse and keyboard LEDs.

Version-Release number of selected component (if applicable):

2.6.16-1.2069_FC4 and kernel-2.6.16-1.2096_FC4


How reproducible:

always

Steps to Reproduce:
1. rebuild install tree with updated packages
2. rebuild anaconda against that tree
3. attempt install
  
Actual results:

freezes partway through install


Expected results:

no freeze. :)

Additional info:

I realize that you can't really reproduce these conditions here, but I wanted to
mention this to you, because I'm pretty sure the system should never do this,
and clearly the kernel update is a factor.

Comment 1 Matthew Miller 2006-04-24 20:33:55 UTC

At first, I thought this was a reocurrence of bug #172782.

But, that just results in an unkillable hung process. This is a whole hung
system. Not only does that make it harder to diagonse, it seems likely that it's
a different thing. Once, by running "top -d 0" in the shell on vt2, I did notice
that the hang happened while groupadd was running. But another time, the rpm
script seemed to be in the middle of scrollkeeper-update.

Comment 2 Matthew Miller 2006-05-03 19:50:37 UTC

Tested with kernel-2.6.16-1.2107_FC4 -- still an issue.

Comment 3 Matthew Miller 2006-05-08 19:03:49 UTC

Still happens reliably with kernel-2.6.16-1.2108_FC4

I'm going to see if I can reproduce the error in a more-standard running
environment.

Comment 4 Matthew Miller 2006-06-06 14:38:12 UTC

Urgh. Can't reproduce outside of the anaconda environment -- works apparently
fine. However, still locks up anaconda every time. Strange.

Comment 5 Matthew Miller 2006-06-08 20:25:57 UTC

kernel-2.6.16-1.2114_FC4 doesn't help either.

Comment 6 Matthew Miller 2006-07-07 02:06:57 UTC

kernel-2.6.17-1.2141_FC4, same deal. :)

Comment 7 Matthew Miller 2006-07-26 17:44:37 UTC

Okay, so rather than just hoping this gets fixed or someone knows about it
already, I've actually found some time to try to diagnose it. I think the issue
is a locking problem in ext3. More soonish.

Comment 8 Matthew Miller 2006-07-26 18:51:02 UTC

So, uh, the magic sysrq keys work fine. I just assumed that the system was so
hung that they wouldn't. That was dumb of me. Seeing what I can find from that.

In the meantime, I have verified that the problem *only* happens when installing
onto ext3. ext2, reiserfs, and xfs are fine.

Comment 9 Matthew Miller 2006-07-26 20:57:32 UTC

However, XFS does log this error, if I turn up the log level sufficiently:

BUG: xfs_io/2545, lock held at task exit time!
 [da2b2da0] {init_once}
.. held by:            xfs_io: 2545 [da2fe000, 118]
... acquired at:               freeze_bdev+0xc/0x5d
end_request: I/O error, dev fd0, sector 0


I am currently fighting with the serial port on my thinkpad; hopefully I'll be
able to actually get some useful diagnostic information here. I know FC4 is
about to go away, but since the FC4 kernel is so close to the FC5 (and devel,
yeah?) kernels, it still seems worth my time.

Comment 10 Matthew Miller 2006-07-27 16:30:03 UTC

Created attachment 133166 [details]
Magic SysRq t and p

So, hopefully this is helpful....

Comment 11 Matthew Miller 2006-07-27 16:31:48 UTC

Created attachment 133167 [details]
more magic sysrq output from another run

Comment 12 Matthew Miller 2006-07-27 20:36:14 UTC

Okay, so....... here's some amateur sleuthing.

Poking around with sysrq-p revealed that it's stuck in a pretty small loop.
Specifically, somewhere around 0x26e1 to 0x27f1 in jbd.o. The 0x27f1 appears to
be the end part of a loop -- it tests and jumps back. So the relevant bits are
(with the string "journal_commit_transaction" shortened to "j_c_t" to make
reading easier in bugzilla, since it wants to wrap.)

    26dc:       e9 10 01 00 00          jmp    27f1 <j_c_t+0x8c0>
    26e1:       8b 70 24                mov    0x24(%eax),%esi
    26e4:       8b 1e                   mov    (%esi),%ebx
    26e6:       8b 03                   mov    (%ebx),%eax
    26e8:       a8 04                   test   $0x4,%al
    26ea:       74 31                   je     271d <j_c_t+0x7ec>
    26ec:       ba 26 01 00 00          mov    $0x126,%edx
    26f1:       b8 62 07 00 00          mov    $0x762,%eax
    26f6:       e8 fc ff ff ff          call   26f7 <j_c_t+0x7c6>
    26fb:       e8 fc ff ff ff          call   26fc <j_c_t+0x7cb>
    2700:       8b 03                   mov    (%ebx),%eax
    2702:       a8 04                   test   $0x4,%al
    2704:       75 0b                   jne    2711 <j_c_t+0x7e0>
    2706:       8b 43 30                mov    0x30(%ebx),%eax
    2709:       85 c0                   test   %eax,%eax
    270b:       0f 85 e0 00 00 00       jne    27f1 <j_c_t+0x8c0>
    2711:       89 d8                   mov    %ebx,%eax
    2713:       e8 fc ff ff ff          call   2714 <j_c_t+0x7e3>
    2718:       e9 d4 00 00 00          jmp    27f1 <j_c_t+0x8c0>

    [stuff which appears to never be reached]

    27f1:       8b 0c 24                mov    (%esp),%ecx
    27f4:       8b 41 2c                mov    0x2c(%ecx),%eax
    27f7:       85 c0                   test   %eax,%eax
    27f9:       0f 85 e2 fe ff ff       jne    26e1 <j_c_t+0x7b0>

Which I *believe* corresponds to fs/jbd/commit.c, line 618:

  wait_for_iobuf:
          while (commit_transaction->t_iobuf_list != NULL) {
                  struct buffer_head *bh;
  
                  jh = commit_transaction->t_iobuf_list->b_tprev;
                  bh = jh2bh(jh);
                  if (buffer_locked(bh)) {    
                          wait_on_buffer(bh); 
                          goto wait_for_iobuf;
                  }

although it's possible it's the "wait_for_ctlbuf:" section at line 674, which is
very similar.

It goes around and around in this loop forever. Most of the operations here
appear to be trivial -- I don't understand enough to figure out why
buffer_locked(bh) always comes back to being true, but it seems to always take
that branch and the goto back to the start. (And by the way, that  buffer_##name
thing confused the heck out of me.)

I notice that "CONFIG_PREEMPT_VOLUNTARY=y" in 2.6.17-1.2143_FC4, but
"CONFIG_PREEMPT_NONE=y" in the last working kernel (2.6.15-1.1833_FC4) --
perhaps that is to blame?

Comment 13 Matthew Miller 2006-07-28 14:52:47 UTC

Drat. I got all excited about the possibility that it was the preemption, but
when I rebuilt 2.6.17-1.2143_FC4 with preemption off, I'm _still_ getting the hang.

I guess my next test here is to try with the devel tree kernel, 'cause FC4 is
about to go away.

Comment 14 Matthew Miller 2006-07-28 16:58:29 UTC

Ahhhh, this looks familiar:

http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/73657d0fb6a7cd38/7c854eedb01988b7?tvc=2

And it explains why it was only getting triggered in the installer.

Comment 15 Matthew Miller 2006-07-28 17:06:30 UTC

And looks like the patch-2.6.18-rc2 in the current tree includes the fix
suggested there. I'm going to try it (on Monday...) but presumably that means I
can mark this as fixed in Rawhide.

Comment 16 Matthew Miller 2006-07-29 01:44:05 UTC

Specifically, this from the changelog. Now I need to go figure out git so I can
pull the actual specific patch. :)

commit e7b384043e27bed4f23b108481b99c518dd01a01
Author: Andrew Morton <akpm>
Date:   Fri Jun 30 01:56:00 2006 -0700

    [PATCH] cond_resched() fix
    
    Fix a bug identified by Zou Nan hai <nanhai.zou>:
    
    If the system is in state SYSTEM_BOOTING, and need_resched() is true,
    cond_resched() returns true even though it didn't reschedule.  Consequently
    need_resched() remains true and JBD locks up.
    
    Fix that by teaching cond_resched() to only return true if it really did call
    schedule().
    
    cond_resched_lock() and cond_resched_softirq() have a problem too.  If we're
    in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
    the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
    will prevent schedule() from being called.  So on return, need_resched() will
    still be true, but cond_resched_lock() has to return 1 to tell the caller that
    the lock was dropped.  The caller will probably lock up.
    
    Bottom line: if these functions dropped the lock, they _must_ call schedule()
    to clear need_resched().   Make it so.
    
    Also, uninline __cond_resched().  It's largeish, and slowpath.
    
    Acked-by: Ingo Molnar <mingo>
    Cc: <stable>
    Signed-off-by: Andrew Morton <akpm>
    Signed-off-by: Linus Torvalds <torvalds>

Comment 17 Jarod Wilson 2006-07-31 18:30:07 UTC

Reassigning to our new ext3 front man, since it appears to be an ext3 issue. I
believe this is the patch you're after:

http://lkml.org/lkml/diff/2006/6/28/83/1

Comment 18 Matthew Miller 2006-07-31 18:48:06 UTC

Yes, that's the one. (I did manage to figure out how to pull a given commit from
git; turns out it's very easy once you know you're looking for "git-show".)

It's also worth mentioning here that Dave Jones noted on fedora-devel-list that
he's going to check with the stable kernel maintainers about getting this fix
into 2.6.17.8 upstream.

Comment 19 Eric Sandeen 2006-08-04 20:15:15 UTC

This fix is now in upstream & rawhide, and should make it to fedora...

Comment 20 Matthew Miller 2006-08-05 00:23:42 UTC

Thanks Eric. I was hoping that a final FC4 update would come out and contain
this fix, but oh well.

Comment 21 Eric Sandeen 2006-08-05 02:26:18 UTC

I'll ping Dave & see if it can still happen, not sure :)

Comment 22 Matthew Miller 2006-08-05 02:29:17 UTC

I think he was expecting 2.6.17.8 by now. Not sure how the suggestion to
upstream to include this fix in that was taken. Thanks for checking -- I really
appreciate it.

Comment 23 Matthew Miller 2006-08-07 21:04:37 UTC

Cool -- looks like this did get into 2.6.17.8. http://lwn.net/Articles/194345/

Thanks Dave!

Note You need to log in before you can comment on or make changes to this bug.