Description of problem: I know this isn't supported, but I'm reporting it anyway as it seems like it's probably symptomatic of a general problem. I'm rebuilding anaconda and an FC4 tree with the latest updates integrated. This works fine with kernel 2.6.15-1.1833_FC4, but when I updated to 2.6.16-1.2069, the installer reliably hangs somewhere in the process of installing packages. The problem also occurs with 2.6.16-1.2096_FC4, but went away when I reverted to 2.6.15-1.1833_FC4. The place where the hang happens varies from system to system and depending on options chosen, but is usually about 1/10 to 1/3 of the way though -- I haven't had an install complete. This is on a wide variety of hardware, ranging from a Pentium III SMP system (i686 kernel) to some new Opteron and Athlon64 machines. Happens in both text and GUI modes -- the system totally hangs, including the mouse and keyboard LEDs. Version-Release number of selected component (if applicable): 2.6.16-1.2069_FC4 and kernel-2.6.16-1.2096_FC4 How reproducible: always Steps to Reproduce: 1. rebuild install tree with updated packages 2. rebuild anaconda against that tree 3. attempt install Actual results: freezes partway through install Expected results: no freeze. :) Additional info: I realize that you can't really reproduce these conditions here, but I wanted to mention this to you, because I'm pretty sure the system should never do this, and clearly the kernel update is a factor.
At first, I thought this was a reocurrence of bug #172782. But, that just results in an unkillable hung process. This is a whole hung system. Not only does that make it harder to diagonse, it seems likely that it's a different thing. Once, by running "top -d 0" in the shell on vt2, I did notice that the hang happened while groupadd was running. But another time, the rpm script seemed to be in the middle of scrollkeeper-update.
Tested with kernel-2.6.16-1.2107_FC4 -- still an issue.
Still happens reliably with kernel-2.6.16-1.2108_FC4 I'm going to see if I can reproduce the error in a more-standard running environment.
Urgh. Can't reproduce outside of the anaconda environment -- works apparently fine. However, still locks up anaconda every time. Strange.
kernel-2.6.16-1.2114_FC4 doesn't help either.
kernel-2.6.17-1.2141_FC4, same deal. :)
Okay, so rather than just hoping this gets fixed or someone knows about it already, I've actually found some time to try to diagnose it. I think the issue is a locking problem in ext3. More soonish.
So, uh, the magic sysrq keys work fine. I just assumed that the system was so hung that they wouldn't. That was dumb of me. Seeing what I can find from that. In the meantime, I have verified that the problem *only* happens when installing onto ext3. ext2, reiserfs, and xfs are fine.
However, XFS does log this error, if I turn up the log level sufficiently: BUG: xfs_io/2545, lock held at task exit time! [da2b2da0] {init_once} .. held by: xfs_io: 2545 [da2fe000, 118] ... acquired at: freeze_bdev+0xc/0x5d end_request: I/O error, dev fd0, sector 0 I am currently fighting with the serial port on my thinkpad; hopefully I'll be able to actually get some useful diagnostic information here. I know FC4 is about to go away, but since the FC4 kernel is so close to the FC5 (and devel, yeah?) kernels, it still seems worth my time.
Created attachment 133166 [details] Magic SysRq t and p So, hopefully this is helpful....
Created attachment 133167 [details] more magic sysrq output from another run
Okay, so....... here's some amateur sleuthing. Poking around with sysrq-p revealed that it's stuck in a pretty small loop. Specifically, somewhere around 0x26e1 to 0x27f1 in jbd.o. The 0x27f1 appears to be the end part of a loop -- it tests and jumps back. So the relevant bits are (with the string "journal_commit_transaction" shortened to "j_c_t" to make reading easier in bugzilla, since it wants to wrap.) 26dc: e9 10 01 00 00 jmp 27f1 <j_c_t+0x8c0> 26e1: 8b 70 24 mov 0x24(%eax),%esi 26e4: 8b 1e mov (%esi),%ebx 26e6: 8b 03 mov (%ebx),%eax 26e8: a8 04 test $0x4,%al 26ea: 74 31 je 271d <j_c_t+0x7ec> 26ec: ba 26 01 00 00 mov $0x126,%edx 26f1: b8 62 07 00 00 mov $0x762,%eax 26f6: e8 fc ff ff ff call 26f7 <j_c_t+0x7c6> 26fb: e8 fc ff ff ff call 26fc <j_c_t+0x7cb> 2700: 8b 03 mov (%ebx),%eax 2702: a8 04 test $0x4,%al 2704: 75 0b jne 2711 <j_c_t+0x7e0> 2706: 8b 43 30 mov 0x30(%ebx),%eax 2709: 85 c0 test %eax,%eax 270b: 0f 85 e0 00 00 00 jne 27f1 <j_c_t+0x8c0> 2711: 89 d8 mov %ebx,%eax 2713: e8 fc ff ff ff call 2714 <j_c_t+0x7e3> 2718: e9 d4 00 00 00 jmp 27f1 <j_c_t+0x8c0> [stuff which appears to never be reached] 27f1: 8b 0c 24 mov (%esp),%ecx 27f4: 8b 41 2c mov 0x2c(%ecx),%eax 27f7: 85 c0 test %eax,%eax 27f9: 0f 85 e2 fe ff ff jne 26e1 <j_c_t+0x7b0> Which I *believe* corresponds to fs/jbd/commit.c, line 618: wait_for_iobuf: while (commit_transaction->t_iobuf_list != NULL) { struct buffer_head *bh; jh = commit_transaction->t_iobuf_list->b_tprev; bh = jh2bh(jh); if (buffer_locked(bh)) { wait_on_buffer(bh); goto wait_for_iobuf; } although it's possible it's the "wait_for_ctlbuf:" section at line 674, which is very similar. It goes around and around in this loop forever. Most of the operations here appear to be trivial -- I don't understand enough to figure out why buffer_locked(bh) always comes back to being true, but it seems to always take that branch and the goto back to the start. (And by the way, that buffer_##name thing confused the heck out of me.) I notice that "CONFIG_PREEMPT_VOLUNTARY=y" in 2.6.17-1.2143_FC4, but "CONFIG_PREEMPT_NONE=y" in the last working kernel (2.6.15-1.1833_FC4) -- perhaps that is to blame?
Drat. I got all excited about the possibility that it was the preemption, but when I rebuilt 2.6.17-1.2143_FC4 with preemption off, I'm _still_ getting the hang. I guess my next test here is to try with the devel tree kernel, 'cause FC4 is about to go away.
Ahhhh, this looks familiar: http://groups.google.com/group/fa.linux.kernel/browse_thread/thread/73657d0fb6a7cd38/7c854eedb01988b7?tvc=2 And it explains why it was only getting triggered in the installer.
And looks like the patch-2.6.18-rc2 in the current tree includes the fix suggested there. I'm going to try it (on Monday...) but presumably that means I can mark this as fixed in Rawhide.
Specifically, this from the changelog. Now I need to go figure out git so I can pull the actual specific patch. :) commit e7b384043e27bed4f23b108481b99c518dd01a01 Author: Andrew Morton <akpm> Date: Fri Jun 30 01:56:00 2006 -0700 [PATCH] cond_resched() fix Fix a bug identified by Zou Nan hai <nanhai.zou>: If the system is in state SYSTEM_BOOTING, and need_resched() is true, cond_resched() returns true even though it didn't reschedule. Consequently need_resched() remains true and JBD locks up. Fix that by teaching cond_resched() to only return true if it really did call schedule(). cond_resched_lock() and cond_resched_softirq() have a problem too. If we're in SYSTEM_BOOTING state and need_resched() is true, these functions will drop the lock and will then try to call schedule(), but the SYSTEM_BOOTING state will prevent schedule() from being called. So on return, need_resched() will still be true, but cond_resched_lock() has to return 1 to tell the caller that the lock was dropped. The caller will probably lock up. Bottom line: if these functions dropped the lock, they _must_ call schedule() to clear need_resched(). Make it so. Also, uninline __cond_resched(). It's largeish, and slowpath. Acked-by: Ingo Molnar <mingo> Cc: <stable> Signed-off-by: Andrew Morton <akpm> Signed-off-by: Linus Torvalds <torvalds>
Reassigning to our new ext3 front man, since it appears to be an ext3 issue. I believe this is the patch you're after: http://lkml.org/lkml/diff/2006/6/28/83/1
Yes, that's the one. (I did manage to figure out how to pull a given commit from git; turns out it's very easy once you know you're looking for "git-show".) It's also worth mentioning here that Dave Jones noted on fedora-devel-list that he's going to check with the stable kernel maintainers about getting this fix into 2.6.17.8 upstream.
This fix is now in upstream & rawhide, and should make it to fedora...
Thanks Eric. I was hoping that a final FC4 update would come out and contain this fix, but oh well.
I'll ping Dave & see if it can still happen, not sure :)
I think he was expecting 2.6.17.8 by now. Not sure how the suggestion to upstream to include this fix in that was taken. Thanks for checking -- I really appreciate it.
Cool -- looks like this did get into 2.6.17.8. http://lwn.net/Articles/194345/ Thanks Dave!