Bug 158413 - (busted vdso) i686 SMP kernel stuck during boot, UP works
(busted vdso) i686 SMP kernel stuck during boot, UP works
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
All Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
: 156664 157691 158816 (view as bug list)
Depends On:
Blocks: FC4Blocker
  Show dependency treegraph
 
Reported: 2005-05-21 23:42 EDT by Warren Togami
Modified: 2015-01-04 17:19 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-05-27 21:23:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
contents of /proc/cpuinfo for i686 smp system using 1315 kernel (823 bytes, text/plain)
2005-05-23 16:49 EDT, Jef Spaleta
no flags Details
contents of /proc/cpuinfo for i686 smp system using 1315 kernel (823 bytes, text/plain)
2005-05-23 16:49 EDT, Jef Spaleta
no flags Details
/proc/cpuinfo - 2.6.11-1.1319_FC4smp - machine boots fine (832 bytes, text/plain)
2005-05-23 21:05 EDT, gene c
no flags Details
/proc/cpuinfo for P4 w/HT -- can't boot 1340 (1.01 KB, text/plain)
2005-05-23 22:38 EDT, David Sklar
no flags Details
Picture of end of Alt-Sysrq-T when hung (1.74 MB, image/jpeg)
2005-05-23 22:51 EDT, gene c
no flags Details
SysRQ Show State when it gets stuck (13.25 KB, text/plain)
2005-05-24 22:23 EDT, Warren Togami
no flags Details

  None (edit)
Description Warren Togami 2005-05-21 23:42:33 EDT
Description of problem:
Dual CPU Opteron with FC4 32bit installed.  Bootup gets stuck at:

EXT3-fs: mounted filesystem with ordered data mode.
Switching to new root
unmounting old /proc
unmounting old /sys
cfq: depth 4 reached, tagging now on

CFQ is not at fault, because elevator=deadline gets stuck there too, without the
CFQ message of course.

This point seems to be after initrd's "init" script, where it seems to load
SELinux.  Booting the UP kernel gets past this point with these kind of messages:

security:  3 users, 6 roles, 760 types, 87 bools
security:  55 classes, 170468 rules
SELinux:  Completing initialization.

Disabling selinux in /etc/sysconfig/selinux or booting with maxcpus=1 makes no
difference.

Version-Release number of selected component (if applicable):
WORKING kernel-2.6.11-1.1253_FC4smp
WORKING kernel-2.6.11-1.1267_FC4smp
WORKING kernel-2.6.11-1.1268_FC4smp
  (broke somewhere here, no builds available in between)
BROKEN  kernel-2.6.11-1.1275_FC4smp
BROKEN  kernel-2.6.11-1.1276_FC4smp
BROKEN  kernel-2.6.11-1.1286_FC4smp
BROKEN  kernel-2.6.11-1.1323_FC4smp
BROKEN  kernel-2.6.11-1.1337_FC4smp

Hardware
========
Tyan motherboard
2x1.4GHz Opteron
Adaptec I2O with i2o_block driver
Bug #158410 mentions similar behavior with a SATA controller on dual Opteron.
Comment 2 Warren Togami 2005-05-22 00:42:52 EDT
x86_64 FC4 on the same hardware does not have this SMP kernel problem.

What we don't know however is if plain i686 SMP hardware is affected by this
problem, in that case we should fix this before FC4.  If it is unaffected then
this shouldn't be a blocker.
Comment 3 Sean Bruno 2005-05-22 12:31:45 EDT
I was able to boot the kernel from
http://people.redhat.com/wtogami/temp/kernel-smp-2.6.11-1.1267_FC4.i686.rpm on a
dual Opteron system built on the ASUS K8N-DL with 246 model opterons.  However
the Broadcom NetXtreme Ethernet controller(tg3) seems to have an issue as I am
unable to get on the network with this kernel.

If I boot off of the uniprocessor kernel from FC4T3 or any rawhide update, the
ethernet controller works just fine.
Comment 4 Warren Togami 2005-05-23 01:03:13 EDT
*** Bug 157691 has been marked as a duplicate of this bug. ***
Comment 5 Warren Togami 2005-05-23 01:06:05 EDT
Bug 157691 confirms that this is a general i686 SMP problem that affects both
32bit AMD64 and Pentium 4/Xeon.  We should try to avoid releasing FC4 with this
problem.
Comment 6 Warren Togami 2005-05-23 01:13:46 EDT
*** Bug 156664 has been marked as a duplicate of this bug. ***
Comment 7 Warren Togami 2005-05-23 01:19:09 EDT
Gene, in Bug 156664 #c3 you mention that the SMP kernel successfully boots on a
dual Pentium4 without HT?  Could you please attach a text file containing
/proc/cpuinfo from that machine?
Comment 8 Warren Togami 2005-05-23 01:33:28 EDT
It would help if somebody with a serial console could do the following procedure:

1) Apply the below patch to /sbin/mkinitrd script.

--- mkinitrd.orig       2005-05-22 19:28:32.000000000 -1000
+++ mkinitrd    2005-05-22 19:29:22.000000000 -1000
@@ -749,6 +749,8 @@
   echo "echo Mounting root filesystem" >> $RCFILE
   echo "mount -o $rootopts --ro -t $rootfs $rootdev /sysroot" >> $RCFILE

+  echo "echo Enabling Magic SysRQ" >> $RCFILE
+  echo "echo echo 1 > /proc/sys/kernel/sysrq" >> $RCFILE
   echo "echo Switching to new root" >> $RCFILE
   if [ -n "$UDEV_KEEP_DEV" ]; then
     echo "switchroot --movedev /sysroot" >> $RCFILE

2) Create a new initrd image for the latest SMP kernel.  Make a backup of the
existing initrd just in case you somehow screw it up.  Doing this would be
something like:
mv /boot/initrd-2.6.11-1.XXXX_FC4smp.img
/boot/initrd-2.6.11-1.XXXX_FC4smp.img.backup
/sbin/mkinitrd /boot/initrd-2.6.11-1.XXXX_FC4smp.img 2.6.11-1.XXXX_FC4smp

3) Reboot using that new initrd.  When it gets stuck, hit ALT-SysRQ-T.  Save the
entire dump into a text file and attach it in this bug.
Comment 9 Warren Togami 2005-05-23 01:36:50 EDT
Oops... one too many echos.

--- mkinitrd.orig       2005-05-22 19:28:32.000000000 -1000
+++ mkinitrd    2005-05-22 19:37:04.000000000 -1000
@@ -749,6 +749,8 @@
   echo "echo Mounting root filesystem" >> $RCFILE
   echo "mount -o $rootopts --ro -t $rootfs $rootdev /sysroot" >> $RCFILE

+  echo "echo Enabling Magic SysRQ" >> $RCFILE
+  echo "echo 1 > /proc/sys/kernel/sysrq" >> $RCFILE
   echo "echo Switching to new root" >> $RCFILE
   if [ -n "$UDEV_KEEP_DEV" ]; then
     echo "switchroot --movedev /sysroot" >> $RCFILE
Comment 10 Warren Togami 2005-05-23 16:33:34 EDT
If your i686 SMP boots with the FC4 smp kernel, please submit your /proc/cpuinfo
in an attachment.  If you lock up during boot, please attach alt-sysrq-T as
indicated in Comment #8 and #9 and /proc/cpuinfo.
Comment 11 Jef Spaleta 2005-05-23 16:49:20 EDT
Created attachment 114745 [details]
contents of /proc/cpuinfo for i686 smp system using 1315 kernel

I have an smp i686 machine booting with 1315 rawhide smp kernel.
I'll try booting into 1340 as soon as i'm physically at the machine again.

uname -a
Linux local.localdomain 2.6.11-1.1315_FC4smp #1 SMP Mon May 16 17:14:20 EDT
2005 i686 athlon i386 GNU/Linux

uptime
 16:47:30 up 2 days, 21:08,  3 users,  load average: 0.04, 0.05, 0.07

attached is the output of /proc/cpuinfo
Comment 12 Jef Spaleta 2005-05-23 16:49:46 EDT
Created attachment 114746 [details]
contents of /proc/cpuinfo for i686 smp system using 1315 kernel

I have an smp i686 machine booting with 1315 rawhide smp kernel.
I'll try booting into 1340 as soon as i'm physically at the machine again.

uname -a
Linux local.localdomain 2.6.11-1.1315_FC4smp #1 SMP Mon May 16 17:14:20 EDT
2005 i686 athlon i386 GNU/Linux

uptime
 16:47:30 up 2 days, 21:08,  3 users,  load average: 0.04, 0.05, 0.07

attached is the output of /proc/cpuinfo
Comment 13 gene c 2005-05-23 21:05:50 EDT
Created attachment 114757 [details]
/proc/cpuinfo - 2.6.11-1.1319_FC4smp - machine boots fine
Comment 14 Jef Spaleta 2005-05-23 21:09:19 EDT
(In reply to comment #11)
> I have an smp i686 machine booting with 1315 rawhide smp kernel.
> I'll try booting into 1340 as soon as i'm physically at the machine again.

sorry about the double comment ealier. Booted the i686 smp machine into 1340 smp
kernel.

I have selinux in permissive mode, but from other comments in this report so far
that shouldn't matter I don't think.

-jef
Comment 15 David Sklar 2005-05-23 22:38:02 EDT
Created attachment 114760 [details]
/proc/cpuinfo for P4 w/HT -- can't boot 1340

My i686 SMP (Dell GX280 with 1 P4 and HT turned on) hangs on booting with 1340
(and has since 1276). The last SMP kernel I successfully booted with was 1261
(but I haven't tried anything between 1261 and 1276). The UP kernels boot fine.
When the boot hangs (after the LVM message) I can't reboot with Ctrl-Alt-Del
(no serial console; USB keyboard is completely unresponsive, hitting caps
lock/num lock doesn't change keyboard lights). Upgraded to most recent Dell
BIOS (A05, from A04) with no change.

/proc/cpuinfo is attached.
Comment 16 gene c 2005-05-23 22:51:07 EDT
Created attachment 114761 [details]
Picture of end of   Alt-Sysrq-T when hung

Same system as my early report - HT single CPU - sata disk 
Sorry no serial console - I know its not enuff but this is what was left on
screen when I did Alt-Sysrq-T when it was hung.

gene/
Comment 17 Warren Togami 2005-05-24 17:05:30 EDT
My current theory is that it is failing to boot only on "newer" i686 SMP
machines.  We need to find a common theme here.

Can you folks try rebuilding upstream vanilla 2.6.12-rc4-gitX using the SMP
config file from /boot/config-*?  We need to know if it is an upstream problem,
or something we added.
Comment 18 gene c 2005-05-24 22:17:10 EDT
I built 2.6.12.rc4-git8 using the config-2.6.11-1.1340_FC4smp config from /boot.
I had to comment out IPMI stuff as it gave compile errors.

Sweet - this kernel boots no problem at all.

Best regards,

gene
Comment 19 Warren Togami 2005-05-24 22:23:14 EDT
Created attachment 114810 [details]
SysRQ Show State when it gets stuck
Comment 20 Richard Hitt 2005-05-25 05:31:47 EDT
Hi again Warren.

I built and tested successfully.  Working backwards from git8, I found the same
problem Gene did in git8, git7, git6, git5.  git4 built okay.  I booted git4 and
verified with gkrellm that there appeared to be two CPUs.  I'd also built with
plain 2.6.12-rc4 so I tried booting that, and it too came up fine with two CPUs
showing.  Between Gene and me we've tested rc4, rc4-git4, and rc4-git8.
Comment 21 Warren Togami 2005-05-25 06:42:19 EDT
Bingo!  I rebuilt the 1355 after commenting out patch 810 (exec-shield) and 813
(vdso), and the smp kernel successfully booted.  Arjan suggested this may
indicate a busted vdso, so I tried 1355smp with "vdso=0" and it too successfully
booted.

Busted vdso?
Comment 22 David Sklar 2005-05-25 09:39:01 EDT
I was locking up (see Comment #15), but if I give vdso=0 to 1341smp (the latest
kernel yum finds right now), it boots just fine and /proc/cpuinfo shows me two
CPUs (which are really 1 P4 with HT on).
Comment 23 Roland McGrath 2005-05-25 19:10:04 EDT
I committed the one-liner change to execshield.patch, which needed the update
because of upstream changes.  Dave's next build hopefully wins.

@@ -21,9 +21,9 @@ diff -urNp --exclude-from=/home/davej/.e
 +	/*
 +	 * Push current_thread_info()->sysenter_return to the stack.
 +	 * A tiny bit of offset fixup is necessary - 4*4 means the 4 words
-+	 * pushed above, and the word being pushed now:
++	 * pushed above; +8 corresponds to copy_thread's esp0 setting.
 +	 */
-+	pushl (TI_sysenter_return-THREAD_SIZE+4*4)(%esp)
++	pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp)
  /*
   * Load the potential sixth argument from user stack.
   * Careful about security.
Comment 24 Jeremy Katz 2005-05-25 21:51:31 EDT
1363 works on my box that didn't work before.  Placing in MODIFIED.  If anyone
continues to have problems as of 1363FC4, please reopen.
Comment 25 gene c 2005-05-26 23:55:15 EDT
Confirmed fixed for me too using 1363_FC4smp.

Thanks!
Comment 26 Dave Jones 2005-05-27 21:24:35 EDT
*** Bug 158816 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.