Bug 158413
Summary: | (busted vdso) i686 SMP kernel stuck during boot, UP works | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Warren Togami <wtogami> |
Component: | kernel | Assignee: | Dave Jones <davej> |
Status: | CLOSED RAWHIDE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | gjunk, jspaleta, pfrields, rbh00, roland, sbruno, tech-fedora-bugzilla, tsukahara.ken, wtogami |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2005-05-28 01:23:49 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 136450 | ||
Attachments: |
Description
Warren Togami
2005-05-22 03:42:33 UTC
x86_64 FC4 on the same hardware does not have this SMP kernel problem. What we don't know however is if plain i686 SMP hardware is affected by this problem, in that case we should fix this before FC4. If it is unaffected then this shouldn't be a blocker. I was able to boot the kernel from http://people.redhat.com/wtogami/temp/kernel-smp-2.6.11-1.1267_FC4.i686.rpm on a dual Opteron system built on the ASUS K8N-DL with 246 model opterons. However the Broadcom NetXtreme Ethernet controller(tg3) seems to have an issue as I am unable to get on the network with this kernel. If I boot off of the uniprocessor kernel from FC4T3 or any rawhide update, the ethernet controller works just fine. *** Bug 157691 has been marked as a duplicate of this bug. *** Bug 157691 confirms that this is a general i686 SMP problem that affects both 32bit AMD64 and Pentium 4/Xeon. We should try to avoid releasing FC4 with this problem. *** Bug 156664 has been marked as a duplicate of this bug. *** Gene, in Bug 156664 #c3 you mention that the SMP kernel successfully boots on a dual Pentium4 without HT? Could you please attach a text file containing /proc/cpuinfo from that machine? It would help if somebody with a serial console could do the following procedure: 1) Apply the below patch to /sbin/mkinitrd script. --- mkinitrd.orig 2005-05-22 19:28:32.000000000 -1000 +++ mkinitrd 2005-05-22 19:29:22.000000000 -1000 @@ -749,6 +749,8 @@ echo "echo Mounting root filesystem" >> $RCFILE echo "mount -o $rootopts --ro -t $rootfs $rootdev /sysroot" >> $RCFILE + echo "echo Enabling Magic SysRQ" >> $RCFILE + echo "echo echo 1 > /proc/sys/kernel/sysrq" >> $RCFILE echo "echo Switching to new root" >> $RCFILE if [ -n "$UDEV_KEEP_DEV" ]; then echo "switchroot --movedev /sysroot" >> $RCFILE 2) Create a new initrd image for the latest SMP kernel. Make a backup of the existing initrd just in case you somehow screw it up. Doing this would be something like: mv /boot/initrd-2.6.11-1.XXXX_FC4smp.img /boot/initrd-2.6.11-1.XXXX_FC4smp.img.backup /sbin/mkinitrd /boot/initrd-2.6.11-1.XXXX_FC4smp.img 2.6.11-1.XXXX_FC4smp 3) Reboot using that new initrd. When it gets stuck, hit ALT-SysRQ-T. Save the entire dump into a text file and attach it in this bug. Oops... one too many echos. --- mkinitrd.orig 2005-05-22 19:28:32.000000000 -1000 +++ mkinitrd 2005-05-22 19:37:04.000000000 -1000 @@ -749,6 +749,8 @@ echo "echo Mounting root filesystem" >> $RCFILE echo "mount -o $rootopts --ro -t $rootfs $rootdev /sysroot" >> $RCFILE + echo "echo Enabling Magic SysRQ" >> $RCFILE + echo "echo 1 > /proc/sys/kernel/sysrq" >> $RCFILE echo "echo Switching to new root" >> $RCFILE if [ -n "$UDEV_KEEP_DEV" ]; then echo "switchroot --movedev /sysroot" >> $RCFILE If your i686 SMP boots with the FC4 smp kernel, please submit your /proc/cpuinfo in an attachment. If you lock up during boot, please attach alt-sysrq-T as indicated in Comment #8 and #9 and /proc/cpuinfo. Created attachment 114745 [details]
contents of /proc/cpuinfo for i686 smp system using 1315 kernel
I have an smp i686 machine booting with 1315 rawhide smp kernel.
I'll try booting into 1340 as soon as i'm physically at the machine again.
uname -a
Linux local.localdomain 2.6.11-1.1315_FC4smp #1 SMP Mon May 16 17:14:20 EDT
2005 i686 athlon i386 GNU/Linux
uptime
16:47:30 up 2 days, 21:08, 3 users, load average: 0.04, 0.05, 0.07
attached is the output of /proc/cpuinfo
Created attachment 114746 [details]
contents of /proc/cpuinfo for i686 smp system using 1315 kernel
I have an smp i686 machine booting with 1315 rawhide smp kernel.
I'll try booting into 1340 as soon as i'm physically at the machine again.
uname -a
Linux local.localdomain 2.6.11-1.1315_FC4smp #1 SMP Mon May 16 17:14:20 EDT
2005 i686 athlon i386 GNU/Linux
uptime
16:47:30 up 2 days, 21:08, 3 users, load average: 0.04, 0.05, 0.07
attached is the output of /proc/cpuinfo
Created attachment 114757 [details]
/proc/cpuinfo - 2.6.11-1.1319_FC4smp - machine boots fine
(In reply to comment #11) > I have an smp i686 machine booting with 1315 rawhide smp kernel. > I'll try booting into 1340 as soon as i'm physically at the machine again. sorry about the double comment ealier. Booted the i686 smp machine into 1340 smp kernel. I have selinux in permissive mode, but from other comments in this report so far that shouldn't matter I don't think. -jef Created attachment 114760 [details]
/proc/cpuinfo for P4 w/HT -- can't boot 1340
My i686 SMP (Dell GX280 with 1 P4 and HT turned on) hangs on booting with 1340
(and has since 1276). The last SMP kernel I successfully booted with was 1261
(but I haven't tried anything between 1261 and 1276). The UP kernels boot fine.
When the boot hangs (after the LVM message) I can't reboot with Ctrl-Alt-Del
(no serial console; USB keyboard is completely unresponsive, hitting caps
lock/num lock doesn't change keyboard lights). Upgraded to most recent Dell
BIOS (A05, from A04) with no change.
/proc/cpuinfo is attached.
Created attachment 114761 [details]
Picture of end of Alt-Sysrq-T when hung
Same system as my early report - HT single CPU - sata disk
Sorry no serial console - I know its not enuff but this is what was left on
screen when I did Alt-Sysrq-T when it was hung.
gene/
My current theory is that it is failing to boot only on "newer" i686 SMP machines. We need to find a common theme here. Can you folks try rebuilding upstream vanilla 2.6.12-rc4-gitX using the SMP config file from /boot/config-*? We need to know if it is an upstream problem, or something we added. I built 2.6.12.rc4-git8 using the config-2.6.11-1.1340_FC4smp config from /boot. I had to comment out IPMI stuff as it gave compile errors. Sweet - this kernel boots no problem at all. Best regards, gene Created attachment 114810 [details]
SysRQ Show State when it gets stuck
Hi again Warren. I built and tested successfully. Working backwards from git8, I found the same problem Gene did in git8, git7, git6, git5. git4 built okay. I booted git4 and verified with gkrellm that there appeared to be two CPUs. I'd also built with plain 2.6.12-rc4 so I tried booting that, and it too came up fine with two CPUs showing. Between Gene and me we've tested rc4, rc4-git4, and rc4-git8. Bingo! I rebuilt the 1355 after commenting out patch 810 (exec-shield) and 813 (vdso), and the smp kernel successfully booted. Arjan suggested this may indicate a busted vdso, so I tried 1355smp with "vdso=0" and it too successfully booted. Busted vdso? I was locking up (see Comment #15), but if I give vdso=0 to 1341smp (the latest kernel yum finds right now), it boots just fine and /proc/cpuinfo shows me two CPUs (which are really 1 P4 with HT on). I committed the one-liner change to execshield.patch, which needed the update because of upstream changes. Dave's next build hopefully wins. @@ -21,9 +21,9 @@ diff -urNp --exclude-from=/home/davej/.e + /* + * Push current_thread_info()->sysenter_return to the stack. + * A tiny bit of offset fixup is necessary - 4*4 means the 4 words -+ * pushed above, and the word being pushed now: ++ * pushed above; +8 corresponds to copy_thread's esp0 setting. + */ -+ pushl (TI_sysenter_return-THREAD_SIZE+4*4)(%esp) ++ pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp) /* * Load the potential sixth argument from user stack. * Careful about security. 1363 works on my box that didn't work before. Placing in MODIFIED. If anyone continues to have problems as of 1363FC4, please reopen. Confirmed fixed for me too using 1363_FC4smp. Thanks! *** Bug 158816 has been marked as a duplicate of this bug. *** |