Bug 175616

Summary: [RHEL 4 U2] kernel panic on EM64T with long cmdline args
Product: Red Hat Enterprise Linux 4 Reporter: Brian Long <brilong>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron, k.georgiou, peterm, rick.beldin, tao
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2006-0575 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-08-10 21:44:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 181409, 185624    
Attachments:
Description Flags
HP DL360 G4 during kickstart of RHEL 4 U2
none
HP DL380 G4 GRUB boot with 2.6.9-22.0.1.EL UP
none
DL585 Quad Opteron panic none

Description Brian Long 2005-12-13 13:45:51 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050921 Red Hat/1.0.7-1.4.1 Firefox/1.0.7

Description of problem:
I am kickstarting from the extracted RHEL 4 U2 ISO images although this problem happens outside of kickstart as well.  The kernel normally allows for up to 255 characters to be specified as boot parameters.  In kickstart, these boot parameters are passed to loader and Anaconda.  On any system except the HP DL360 G4 and DL380 G4 (both dual Xeon EM64T), the following kernel command-line is acceptable (from syslinux.cfg or pxelinux.cfg):

append initrd=pxe-tftp.esl.cisco.com::images/releases/5.00b4-4/x86_64/initrd.img ks=http://wwwin-kickstart-dev/cgi-bin/pxe/getkscfg?macaddr=00:12:79:3d:f7:d8 ramdisk_size=8192 ksdevice=00:12:79:3d:f7:d8  nostorage

As you can see, the kernel append line is 207 characters.  The x86_64 kickstart environment crashes the kernel with this append line.  The i686 kickstart works fine.  If I shorten the append line to 176 characters or less, the kickstart on x86_64 completes successfully.

This appears to be an x86_64 kernel issue only on the EM64T platform running the UP kernel.  If I pass a long command line to the SMP x86_64 kernel, the host boots without problems.

Please note, I can reproduce this on a running system as well.  If I configure GRUB to boot the 2.6.9-22.EL UP or 2.6.9-22.0.1.EL UP kernel and specify a large command-line, it will fail.  For example, if I specify:

ro root=LABEL=/ quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet console=tty

The kernel will crash.  If I shorten to less than 176 characters, the kernel will boot properly.

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.EL

How reproducible:
Always

Steps to Reproduce:
1. Boot UP kernel on EM64T host
2. Specify command line 177 characters or longer
3. Kernel oops
  

Actual Results:  Kernel oops (screenshot attached)

Expected Results:  Kernel should boot fine.

Additional info:

This also happens in the 2.6.9-22.0.1.EL errata kernel.

Comment 1 Brian Long 2005-12-13 13:46:50 UTC
Created attachment 122178 [details]
HP DL360 G4 during kickstart of RHEL 4 U2

Comment 2 Brian Long 2005-12-13 13:47:42 UTC
Created attachment 122179 [details]
HP DL380 G4 GRUB boot with 2.6.9-22.0.1.EL UP

Comment 3 Brian Long 2005-12-13 13:59:51 UTC
This may be related to
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=164643 which was never
resolved from U1.

Comment 4 Brian Long 2005-12-13 16:58:07 UTC
Created attachment 122183 [details]
DL585 Quad Opteron panic

As it turns out, I am able to duplicate this problem on an HP, the DL585 quad
opteron server trying to kickstart using the x86_64 UP kernel.	However, other
opterons (Sun V20z and V40z) do not have this problem.

However, unlike the EM64T systems, once the DL585 has the x86_64 OS installed,
I am able to boot the UP kernel with a 255-character command-line and it does
not crash.

Comment 5 Jim Paradis 2005-12-16 04:01:22 UTC
I don't think it's related to Bug 164643; that one is a NULL pointer
dereference.  This one is different.

Disassembling the kernel and looking at the faulting address, we see that
start_kernel() is trying to call late_time_init().  late_time_init() is in fact
a pointer to a routine, so it's stored in kernel data space.  Immediately
preceeding it is "saved_command_line".  For some reason, although the main
command line buffer has been boosted to 2048 on x86_64, saved_command_line is
still at 256.  If you note carefully the RAX register (which is supposed to
contain the "late_time_init" routine pointer or NULL) seems to contain ASCII data.

Should be a straightforward fix.


Comment 6 Andy Gospodarek 2005-12-16 12:30:25 UTC
I saw the same thing and actually have it in the Issue Tracker.  Sorry I didn't
update the BZ as well.  

It seems we are crashing in "init/main.c" line 544 (address 0xffffffff80539668)
because 'late_time_init' is set to a bogus value (0x656c6f736e6f6320 but not
NULL) stored in register %eax.    Here is the vmlinux dissassembled.

ffffffff80539649:       00 00 00 00
ffffffff8053964d:       e8 be 16 01 00          callq  ffffffff8054ad10
<vfs_caches_init_early>
ffffffff80539652:       e8 87 e9 00 00          callq  ffffffff80547fde <mem_init>
ffffffff80539657:       e8 08 0d 01 00          callq  ffffffff8054a364
<kmem_cache_init>
ffffffff8053965c:       48 8b 05 bd d1 f6 ff    mov    -601667(%rip),%rax      
 # ffffffff804a6820 <late_time_init>
ffffffff80539663:       48 85 c0                test   %rax,%rax
ffffffff80539666:       74 02                   je     ffffffff8053966a
<start_kernel+0x1ef>
ffffffff80539668:       ff d0                   callq  *%eax
ffffffff8053966a:       e8 91 29 bd ff          callq  ffffffff8010c000
<calibrate_delay>
ffffffff8053966f:       e8 d7 f8 00 00          callq  ffffffff80548f4b
<pidmap_init>
ffffffff80539674:       e8 b5 0c 01 00          callq  ffffffff8054a32e
<prio_tree_init>
ffffffff80539679:       e8 09 10 01 00          callq  ffffffff8054a687
<anon_vma_init>

Here is the source that matches it from "init/main.c"
      vfs_caches_init_early();
      mem_init();
      kmem_cache_init();
      numa_policy_init();
      if (late_time_init)
              late_time_init();
      calibrate_delay();
      pidmap_init();
      pgtable_cache_init();
      prio_tree_init();
      anon_vma_init();

When dumped the all sections of the kernel I find that 'late_time_init' sits
right after 'saved_command_line.'  

ffffffff804a6700 g     O .bss   0000000000000004 system_state
ffffffff804a6720 g     O .bss   0000000000000100 saved_command_line
ffffffff804a6820 g     O .bss   0000000000000008 late_time_init
ffffffff804a6828 l     O .bss   0000000000000008 execute_command

The data in late_time_init looks like the continuation of the ASCII string from
saved_command_line, so we should consider lengthening that buffer, or find some
other way not to allow more data than 256 bytes to be copied into it.

Comment 7 Brian Long 2005-12-16 14:31:41 UTC
So I understand the U3 kernel is frozen with regards to features, but is this
bug bad enough to warrant being fixed in the U3 candidate (i.e. U3 beta)?  Thanks!

Comment 8 Rick Beldin 2005-12-16 15:46:14 UTC
The value 0x656c6f736e6f6320 turns out to be 'console' (backwards - endianess?).

If this is from the command line above

ro root=LABEL=/ quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet
quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet
quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet quiet
quiet quiet console=tty

That line is 256 chars long.  which means that 'console' is overwriting at about
245... Odd... 

Comment 9 Brian Long 2005-12-16 16:12:10 UTC
Remove a few of the quiet words and the kernel will still oops.  It will oops
all the way down to 176 characters.

Comment 18 Jim Paradis 2006-03-16 21:42:00 UTC
The patch in Comment 15 has been posted to rhkernel-list on 16 Dec 2005 and is a
candidate for inclusion in U4.


Comment 20 Bob Johnson 2006-04-11 16:41:58 UTC
This issue is on Red Hat Engineering's list of planned work items 
for the upcoming Red Hat Enterprise Linux 4.4 release.  Engineering 
resources have been assigned and barring unforeseen circumstances, Red 
Hat intends to include this item in the 4.4 release.

Comment 22 Jason Baron 2006-05-03 17:02:07 UTC
committed in stream U4 build 34.28. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/


Comment 25 Red Hat Bugzilla 2006-08-10 21:44:46 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html


Comment 27 Linda Wang 2006-12-06 18:31:42 UTC
*** Bug 185524 has been marked as a duplicate of this bug. ***