Bug 158413

Summary: (busted vdso) i686 SMP kernel stuck during boot, UP works
Product: [Fedora] Fedora Reporter: Warren Togami <wtogami>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED RAWHIDE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: gjunk, jspaleta, pfrields, rbh00, roland, sbruno, tech-fedora-bugzilla, tsukahara.ken, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-05-28 01:23:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 136450    
Attachments:
Description Flags
contents of /proc/cpuinfo for i686 smp system using 1315 kernel
none
contents of /proc/cpuinfo for i686 smp system using 1315 kernel
none
/proc/cpuinfo - 2.6.11-1.1319_FC4smp - machine boots fine
none
/proc/cpuinfo for P4 w/HT -- can't boot 1340
none
Picture of end of Alt-Sysrq-T when hung
none
SysRQ Show State when it gets stuck none

Description Warren Togami 2005-05-22 03:42:33 UTC
Description of problem:
Dual CPU Opteron with FC4 32bit installed.  Bootup gets stuck at:

EXT3-fs: mounted filesystem with ordered data mode.
Switching to new root
unmounting old /proc
unmounting old /sys
cfq: depth 4 reached, tagging now on

CFQ is not at fault, because elevator=deadline gets stuck there too, without the
CFQ message of course.

This point seems to be after initrd's "init" script, where it seems to load
SELinux.  Booting the UP kernel gets past this point with these kind of messages:

security:  3 users, 6 roles, 760 types, 87 bools
security:  55 classes, 170468 rules
SELinux:  Completing initialization.

Disabling selinux in /etc/sysconfig/selinux or booting with maxcpus=1 makes no
difference.

Version-Release number of selected component (if applicable):
WORKING kernel-2.6.11-1.1253_FC4smp
WORKING kernel-2.6.11-1.1267_FC4smp
WORKING kernel-2.6.11-1.1268_FC4smp
  (broke somewhere here, no builds available in between)
BROKEN  kernel-2.6.11-1.1275_FC4smp
BROKEN  kernel-2.6.11-1.1276_FC4smp
BROKEN  kernel-2.6.11-1.1286_FC4smp
BROKEN  kernel-2.6.11-1.1323_FC4smp
BROKEN  kernel-2.6.11-1.1337_FC4smp

Hardware
========
Tyan motherboard
2x1.4GHz Opteron
Adaptec I2O with i2o_block driver
Bug #158410 mentions similar behavior with a SATA controller on dual Opteron.

Comment 2 Warren Togami 2005-05-22 04:42:52 UTC
x86_64 FC4 on the same hardware does not have this SMP kernel problem.

What we don't know however is if plain i686 SMP hardware is affected by this
problem, in that case we should fix this before FC4.  If it is unaffected then
this shouldn't be a blocker.

Comment 3 Sean Bruno 2005-05-22 16:31:45 UTC
I was able to boot the kernel from
http://people.redhat.com/wtogami/temp/kernel-smp-2.6.11-1.1267_FC4.i686.rpm on a
dual Opteron system built on the ASUS K8N-DL with 246 model opterons.  However
the Broadcom NetXtreme Ethernet controller(tg3) seems to have an issue as I am
unable to get on the network with this kernel.

If I boot off of the uniprocessor kernel from FC4T3 or any rawhide update, the
ethernet controller works just fine.

Comment 4 Warren Togami 2005-05-23 05:03:13 UTC
*** Bug 157691 has been marked as a duplicate of this bug. ***

Comment 5 Warren Togami 2005-05-23 05:06:05 UTC
Bug 157691 confirms that this is a general i686 SMP problem that affects both
32bit AMD64 and Pentium 4/Xeon.  We should try to avoid releasing FC4 with this
problem.

Comment 6 Warren Togami 2005-05-23 05:13:46 UTC
*** Bug 156664 has been marked as a duplicate of this bug. ***

Comment 7 Warren Togami 2005-05-23 05:19:09 UTC
Gene, in Bug 156664 #c3 you mention that the SMP kernel successfully boots on a
dual Pentium4 without HT?  Could you please attach a text file containing
/proc/cpuinfo from that machine?

Comment 8 Warren Togami 2005-05-23 05:33:28 UTC
It would help if somebody with a serial console could do the following procedure:

1) Apply the below patch to /sbin/mkinitrd script.

--- mkinitrd.orig       2005-05-22 19:28:32.000000000 -1000
+++ mkinitrd    2005-05-22 19:29:22.000000000 -1000
@@ -749,6 +749,8 @@
   echo "echo Mounting root filesystem" >> $RCFILE
   echo "mount -o $rootopts --ro -t $rootfs $rootdev /sysroot" >> $RCFILE

+  echo "echo Enabling Magic SysRQ" >> $RCFILE
+  echo "echo echo 1 > /proc/sys/kernel/sysrq" >> $RCFILE
   echo "echo Switching to new root" >> $RCFILE
   if [ -n "$UDEV_KEEP_DEV" ]; then
     echo "switchroot --movedev /sysroot" >> $RCFILE

2) Create a new initrd image for the latest SMP kernel.  Make a backup of the
existing initrd just in case you somehow screw it up.  Doing this would be
something like:
mv /boot/initrd-2.6.11-1.XXXX_FC4smp.img
/boot/initrd-2.6.11-1.XXXX_FC4smp.img.backup
/sbin/mkinitrd /boot/initrd-2.6.11-1.XXXX_FC4smp.img 2.6.11-1.XXXX_FC4smp

3) Reboot using that new initrd.  When it gets stuck, hit ALT-SysRQ-T.  Save the
entire dump into a text file and attach it in this bug.

Comment 9 Warren Togami 2005-05-23 05:36:50 UTC
Oops... one too many echos.

--- mkinitrd.orig       2005-05-22 19:28:32.000000000 -1000
+++ mkinitrd    2005-05-22 19:37:04.000000000 -1000
@@ -749,6 +749,8 @@
   echo "echo Mounting root filesystem" >> $RCFILE
   echo "mount -o $rootopts --ro -t $rootfs $rootdev /sysroot" >> $RCFILE

+  echo "echo Enabling Magic SysRQ" >> $RCFILE
+  echo "echo 1 > /proc/sys/kernel/sysrq" >> $RCFILE
   echo "echo Switching to new root" >> $RCFILE
   if [ -n "$UDEV_KEEP_DEV" ]; then
     echo "switchroot --movedev /sysroot" >> $RCFILE


Comment 10 Warren Togami 2005-05-23 20:33:34 UTC
If your i686 SMP boots with the FC4 smp kernel, please submit your /proc/cpuinfo
in an attachment.  If you lock up during boot, please attach alt-sysrq-T as
indicated in Comment #8 and #9 and /proc/cpuinfo.

Comment 11 Jef Spaleta 2005-05-23 20:49:20 UTC
Created attachment 114745 [details]
contents of /proc/cpuinfo for i686 smp system using 1315 kernel

I have an smp i686 machine booting with 1315 rawhide smp kernel.
I'll try booting into 1340 as soon as i'm physically at the machine again.

uname -a
Linux local.localdomain 2.6.11-1.1315_FC4smp #1 SMP Mon May 16 17:14:20 EDT
2005 i686 athlon i386 GNU/Linux

uptime
 16:47:30 up 2 days, 21:08,  3 users,  load average: 0.04, 0.05, 0.07

attached is the output of /proc/cpuinfo

Comment 12 Jef Spaleta 2005-05-23 20:49:46 UTC
Created attachment 114746 [details]
contents of /proc/cpuinfo for i686 smp system using 1315 kernel

I have an smp i686 machine booting with 1315 rawhide smp kernel.
I'll try booting into 1340 as soon as i'm physically at the machine again.

uname -a
Linux local.localdomain 2.6.11-1.1315_FC4smp #1 SMP Mon May 16 17:14:20 EDT
2005 i686 athlon i386 GNU/Linux

uptime
 16:47:30 up 2 days, 21:08,  3 users,  load average: 0.04, 0.05, 0.07

attached is the output of /proc/cpuinfo

Comment 13 gene c 2005-05-24 01:05:50 UTC
Created attachment 114757 [details]
/proc/cpuinfo - 2.6.11-1.1319_FC4smp - machine boots fine

Comment 14 Jef Spaleta 2005-05-24 01:09:19 UTC
(In reply to comment #11)
> I have an smp i686 machine booting with 1315 rawhide smp kernel.
> I'll try booting into 1340 as soon as i'm physically at the machine again.

sorry about the double comment ealier. Booted the i686 smp machine into 1340 smp
kernel.

I have selinux in permissive mode, but from other comments in this report so far
that shouldn't matter I don't think.

-jef

Comment 15 David Sklar 2005-05-24 02:38:02 UTC
Created attachment 114760 [details]
/proc/cpuinfo for P4 w/HT -- can't boot 1340

My i686 SMP (Dell GX280 with 1 P4 and HT turned on) hangs on booting with 1340
(and has since 1276). The last SMP kernel I successfully booted with was 1261
(but I haven't tried anything between 1261 and 1276). The UP kernels boot fine.
When the boot hangs (after the LVM message) I can't reboot with Ctrl-Alt-Del
(no serial console; USB keyboard is completely unresponsive, hitting caps
lock/num lock doesn't change keyboard lights). Upgraded to most recent Dell
BIOS (A05, from A04) with no change.

/proc/cpuinfo is attached.

Comment 16 gene c 2005-05-24 02:51:07 UTC
Created attachment 114761 [details]
Picture of end of   Alt-Sysrq-T when hung

Same system as my early report - HT single CPU - sata disk 
Sorry no serial console - I know its not enuff but this is what was left on
screen when I did Alt-Sysrq-T when it was hung.

gene/

Comment 17 Warren Togami 2005-05-24 21:05:30 UTC
My current theory is that it is failing to boot only on "newer" i686 SMP
machines.  We need to find a common theme here.

Can you folks try rebuilding upstream vanilla 2.6.12-rc4-gitX using the SMP
config file from /boot/config-*?  We need to know if it is an upstream problem,
or something we added.


Comment 18 gene c 2005-05-25 02:17:10 UTC
I built 2.6.12.rc4-git8 using the config-2.6.11-1.1340_FC4smp config from /boot.
I had to comment out IPMI stuff as it gave compile errors.

Sweet - this kernel boots no problem at all.

Best regards,

gene

Comment 19 Warren Togami 2005-05-25 02:23:14 UTC
Created attachment 114810 [details]
SysRQ Show State when it gets stuck

Comment 20 Richard Hitt 2005-05-25 09:31:47 UTC
Hi again Warren.

I built and tested successfully.  Working backwards from git8, I found the same
problem Gene did in git8, git7, git6, git5.  git4 built okay.  I booted git4 and
verified with gkrellm that there appeared to be two CPUs.  I'd also built with
plain 2.6.12-rc4 so I tried booting that, and it too came up fine with two CPUs
showing.  Between Gene and me we've tested rc4, rc4-git4, and rc4-git8.

Comment 21 Warren Togami 2005-05-25 10:42:19 UTC
Bingo!  I rebuilt the 1355 after commenting out patch 810 (exec-shield) and 813
(vdso), and the smp kernel successfully booted.  Arjan suggested this may
indicate a busted vdso, so I tried 1355smp with "vdso=0" and it too successfully
booted.

Busted vdso?

Comment 22 David Sklar 2005-05-25 13:39:01 UTC
I was locking up (see Comment #15), but if I give vdso=0 to 1341smp (the latest
kernel yum finds right now), it boots just fine and /proc/cpuinfo shows me two
CPUs (which are really 1 P4 with HT on).


Comment 23 Roland McGrath 2005-05-25 23:10:04 UTC
I committed the one-liner change to execshield.patch, which needed the update
because of upstream changes.  Dave's next build hopefully wins.

@@ -21,9 +21,9 @@ diff -urNp --exclude-from=/home/davej/.e
 +	/*
 +	 * Push current_thread_info()->sysenter_return to the stack.
 +	 * A tiny bit of offset fixup is necessary - 4*4 means the 4 words
-+	 * pushed above, and the word being pushed now:
++	 * pushed above; +8 corresponds to copy_thread's esp0 setting.
 +	 */
-+	pushl (TI_sysenter_return-THREAD_SIZE+4*4)(%esp)
++	pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp)
  /*
   * Load the potential sixth argument from user stack.
   * Careful about security.


Comment 24 Jeremy Katz 2005-05-26 01:51:31 UTC
1363 works on my box that didn't work before.  Placing in MODIFIED.  If anyone
continues to have problems as of 1363FC4, please reopen.

Comment 25 gene c 2005-05-27 03:55:15 UTC
Confirmed fixed for me too using 1363_FC4smp.

Thanks!


Comment 26 Dave Jones 2005-05-28 01:24:35 UTC
*** Bug 158816 has been marked as a duplicate of this bug. ***