181721 – Dell PowerEdge servers w/ large disk+memory cause segfaults during rescue mode

Bug 181721 - Dell PowerEdge servers w/ large disk+memory cause segfaults during rescue mode

Summary: Dell PowerEdge servers w/ large disk+memory cause segfaults during rescue mode

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	anaconda
Sub Component:
Version:	4.0
Hardware:	ia32e
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Peter Jones
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	176344 198694 200936 246028
TreeView+	depends on / blocked

Reported:	2006-02-15 23:59 UTC by Shawn Starr
Modified:	2007-12-07 22:14 UTC (History)
CC List:	6 users (show)
Fixed In Version:	RHEL4.6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-12-07 22:14:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Core files (for init and loader) (373.04 KB, application/x-gzip) 2006-03-01 20:42 UTC, Will Nguyen	no flags	Details
View All

Description Shawn Starr 2006-02-15 23:59:53 UTC

Description of problem:
-----------------------
When PXE booting an EM64T host with large memory (2GB+) and large RAID disk
(160GB+) in rescue mode, segfaults will occur with utilities under /sbin.  In
our case, /sbin/init and /sbin/halt will segfault.

Version-Release number of selected component (if applicable):
RHEL4 U1 (Anaconda version = 10.1.1.19, kernel = 2.6.9-11.EL)


How reproducible:
-----------------
Always reproducible


Hardware configuration:
-----------------------
Dell PowerEdge servers with 2GB (or more) of RAM, and RAID-ed drives having
160GB (or more) of space.

   One example of a configuration where we encountered segfaults:
     PE1850 (EM64T) with 4 GB of RAM 
     2 x 300 GB Hard drives (RAID 1)
     onboard RAID controller (ROMB) 


Steps to Reproduce:
-------------------

A) Setup the network server for PXE installations  

1. Add a DHCP entry for the client server on the install server
2. Add the vmlinuz + initrd.img from RHEL4 U1 CD to /tftpboot/pxelinux
3. Add the netstg2.img + product.img from RHEL4 U1 CD to
/var/www/html/abc/RedHat/base
4. In /tftpboot/pxelinux/pxelinux.cfg/default, change the boot params to:
   append initrd=initrd.img ramdisk_size=150000 devfs=nomount headless selinux=0
linux text rescue method=http://10.1.111.1/abc


B) PXE Booting the client host:
1. Make the client boot over PXE
2. Click OK for the Language/Keyboard screen
3. Use DHCP to obtain an IP address
4. For the resuce mode screen, do not mount any drives, just select "Skip"
5. Download a test script from the network server. This test script will:
- use sfdisk to clear the partitions
- dd the disk
- use sfdisk to create a 10GB root partition, 1GB swap partition, and an /export
partition to occupy the rest of the disk
- format the partitions (sda1/sda3 as ext3, and sda2 as swap)
- mount the /dev/sda3 to a temp directory, and then download all of the RHEL4 U1
rpms from the network server (two times)
6. After the script finishes, run /sbin/init and /sbin/halt and verify if they
segfault.

NOTE: the test scripts are available upon request


Actual results:
---------------
After running our test script to re-partition the drive and to download the rpm
packages, we attempted to run /sbin/init, and /sbin/halt, and our results were:

# /sbin/init
Segmentation fault
# /sbin/halt
Segmentation fault

init[3352]: segfault at 0000000000000000 rip 0000000000000000 rsp
0000007fbffff918 error 14
halt[3646]: segfault at 0000000000000000 rip 0000000000000000 rsp
0000007fbffff918 error 14

NOTE: additional anaconda/dmesg log files are available upon request


Expected results:
-----------------
The /sbin/init and /sbin/halt binaries should not segfault.  


Additional info:
----------------
We were able to run the exact same test using Fedora Core 5 Test 3 (kernel =
2.6.15-1.1948_FC5), and could not reproduce the problem (i.e. halt/init did not
segfault).

Comment 1 Shawn Starr 2006-02-16 18:00:57 UTC

Additional results:
-------------------

I was able to reproduce this with the RHEL4 U3 Beta AS EM64T install images
(vmlinuz/initrd.img/netstg2.img/product.img) by running the same test as
described above.

I found that all of the binaries under /sbin segfaulted, including:
  halt, init, loader, modprobe, insmod, shutdown

Comment 2 Will Nguyen 2006-02-17 23:22:13 UTC

Here are some other configurations we used to reproduce the segfault issue on
the Dell PowerEdge 1850.

1) RHEL4 AS U1 (2.6.9-11), RH initrd and netstg2, 2x300GB SCSI disks (no RAID),
mem=2G => init/halt segfaulted
 
2) RHEL AS U3 Beta (2.6.9-27), RH initrd and netstg2, 2x300GB SCSI disks (no
RAID), mem=2G => init, halt, modprobe,insmod, loader segfaulted

Note that in each case, first disk (sda) was partitioned using the layout that
our test script uses (10GB root, 1GB swap, export partition takes up rest of
space), and one big partitioned was created on the 2nd disk (sdb).

Comment 3 Samuel Benjamin 2006-02-28 16:50:01 UTC

Comments from support engineering suggest that it looks as though 
the process is having trouble loading in rescue mode (perhaps an 
appropriate library can't be found).  The binaries from /sbin 
have this problem as many of them are dynamically linked.  
Since we don't have immediate access to an appropriate system, 
we are requesting for some debug information on this problem.  

Please provide the following information from a failing system :
1) A core dump of the segfaulting applications. Core dump 
can be enabled by editing the /etc/profile as follows :

    Comment this line :
    ulimit -S -c 0 > /dev/null 2>1

    Add this line :
    ulimit -S -c unlimited > /dev/null 2>1

    Re-login to allow this change to take effect. 
    Run failing command. 
    (More information on setting up core dumps in KB article at :
          http://kbase.redhat.com/faq/FAQ_39_2890.shtm )

* Provide file core.<pid> generated from this run.

2) An strace of the segfaulting applications

    strace -o strace.out <command> 

* Provide strace.out file.

3) A copy of the the ld.debug.txt file of the application loading after setting
 LD_DEBUG and LD_DEBUG_OUTPUT.

    export LD_DEBUG=all LD_DEBUG_OUTPUT=./ld.debug.txt   
    Run failing command.
    unset LD_DEBUG LD_DEBUG_OUTPUT 

* Provide file ld.debug.txt.<pid>

Comment 4 Shawn Starr 2006-02-28 17:03:10 UTC

This assumes the machine has an OS installed which it doesn't. Can ulimit be set
from inside another shell since we can't log out of a rescue mode shell?

Everything else we can do from the rescue mode.

Comment 5 Samuel Benjamin 2006-03-01 15:47:58 UTC

Just running "ulimit -S -c unlimited" in the current shell seems to enable core
creation. Please try it and let me know if this works. Thanks.

Comment 6 Will Nguyen 2006-03-01 20:41:21 UTC

We reproduce the problem and obtained the following debug info:

1) We were able to generate core files (for init, loader)
2) No strace output could be obtained.  The binary segfaulted right away before
we could get any output.
3) No LD debug info could be obtained. The binary segfaulted right away before
we could get any output.

Here are the stack traces from the corefiles for /bin/init and /bin/loader. We
noticed the stack pointer was null in both corefiles:

Core was generated by `/bin/init'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
(gdb) #0  0x0000000000000000 in ?? ()
#1  0x0000000000407acf in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) q
(gdb) 

Core was generated by `/bin/loader'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
(gdb) #0  0x0000000000000000 in ?? ()
#1  0x000000000047356f in ?? ()
#2  0x0000000000000000 in ?? ()

Also attached are the generated core files.

Comment 7 Will Nguyen 2006-03-01 20:42:42 UTC

Created attachment 125492 [details]
Core files (for init and loader)

Comment 9 Peter Jones 2006-03-17 19:39:30 UTC

Clarify for me -- the bug says ia32e, but you say "an EM64T host" -- is this an
i386 installation onto an em64t box?

Comment 10 Shawn Starr 2006-03-17 19:48:30 UTC

This is a 64bit installation on the an EM64T box. We did not try this with a
32bit installation.

Comment 12 Samuel Benjamin 2006-03-27 20:05:33 UTC

As indicated in the steps to reproduce this problem, engineeing would like to
understand what is being attempted by manually erasing and formatting the drive
from rescue mode, then running /sbin commands (which are segfaulting)?  

Why is kickstart not being used to do the partition creation and do this
installation ?

Comment 13 Will Nguyen 2006-03-29 00:06:40 UTC

This segfault issue was originally reproduced using the Rocks distribution, a
toolkit for building clusters (see http://www.rocksclusters.org).  The version
of Rocks we are testing (Rocks 4.0) is based on a RHEL 4 U1 distribution, using
Anaconda version 10.1.1.13 and kernel 2.6.9-11.  When we tried to install Rocks
4.0 on a Dell PowerEdge 1850 with 2GB of memory and 2 x 300GB RAID disks, the
installation would fail because one of the running processes would segfault
during or at the start of package installation. When you try to type a command
on the 2nd virtual console (Alt-F2), the command would segfault immediately, and
you can see "memory protection faults" appear on the 4th virtual console
(Alt-F4).  We were able to workaround the segfault by telling the kernel to use
less memory using mem=496M when booting.  This issue occured for both CD-based
installs, and network-based installs using Kickstart.

We tried using the RHEL4 U2, and U3 versions of Anaconda and the kernel, but was
still able to reproduce the segfaults.  After removing all of the Rocks-specific
components from the distribution, we came up with a somewhat complicated method
for reproducing the method using only RHEL components on the same hardware. 
This led us to suspect that the issue is Anaconda-related, or RHEL-related as
opposed to being something caused by one of the Rocks-components.

The issue is much easier to reproduce when using the Rocks 4.0 distribution,
than with the vanilla RHEL4 distribution.

Comment 17 RHEL Program Management 2006-08-18 16:38:51 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Peter Jones 2006-08-21 18:11:10 UTC

This really sounds like either something in your setup script is causing memory
corruption, or you have bad RAM or other hardware.  Can you reproduce this
without the test script, using only kickstart to perform the partitioning? 
Also, has anything (such as memtest86) been done to try and see if the hardware
is at fault?

Comment 19 Will Nguyen 2006-08-28 20:04:54 UTC

This issue cannot be reproduced by doing a regular install on the Dell PE1850
using large disk/memory with just RH4 U1 and Kickstart.

There are two ways we can reproduce it:
1) Using RHEL4 U1 + the script (see comment #13 for details)
2) Using Rocks 4.0 (also see comment #13 for details)

The RAM we used in our test machines were not bad since we saw this issue happen
on several machines.  The issue can be workarounded by using the 'mem=512M'
kernel boot parameter.

Comment 23 RHEL Program Management 2006-10-27 19:01:29 UTC

Development Management has reviewed and declined this request.  You may appeal
this decision by reopening this request.

Comment 24 Samuel Benjamin 2006-11-29 21:54:16 UTC

Justification from Dell to pursue a fix for 4.5 :

Event posted 11-09-2006 04:06am by Charles_Rose 	
This issue affects Dell's HPCC offering. We would like to know if Red Hat needs
additional info to help root cause this issue. Is there anything Dell/Platform
Computing can help with to resolve this issue?

We definitely cannot have this issue closed without a resolution.

Comment 25 Peter Jones 2007-01-30 16:21:26 UTC

Why are you setting "ramdisk_size=150000" on the kernel command line?  This
shouldn't be there at all.

Comment 26 Larry Troan 2007-02-20 12:46:32 UTC

Ping to Shawn Starr (sstarr)......
> Comment #25 From Peter Jones (pjones) on 2007-01-30 11:21 EST 
>
> Why are you setting "ramdisk_size=150000" on the kernel command line?  This
> shouldn't be there at all.

It's been 21 days since the NEEDINFO. If this bug is to be diagnosed and fixed
for RHEL4.5, we need to move on it in a more expeditious manner.

Comment 28 Amit Bhutani 2007-02-22 22:16:27 UTC

I have been told that Shawn is OOO and hence unable to respond.

I am not even sure if there is any value in root-causing and fixing this issue
anymore. I will let Shawn comment when ghe ges back but I think it would be a
fair statement to say "Please don't hold up RHEL 4.5 for this" since this is not
a critical issue.

Comment 29 Shawn Starr 2007-02-26 16:46:11 UTC

No, this issue should NOT hold up RHEL 4.5. We do have a workaround. As for the
ramdisk_size. When I have some time, I will try bumping this up to a bigger
value or just removing it altogether.

Comment 31 RHEL Program Management 2007-03-10 01:09:45 UTC

This bugzilla had previously been approved for engineering
consideration but Red Hat Product Management is currently reevaluating
this issue for inclusion in RHEL4.6.

Comment 33 RHEL Program Management 2007-05-09 10:50:52 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 34 John Feeney 2007-07-17 17:58:07 UTC

RHEL4.6 development is quickly coming to an end and this bugzilla is still
unresolved. 

Shawn: Has this issue been resolved by setting the ramdisk size to something
reasonable? Amit suggested that this could be closed (comment #28) and was
wondering if that could be done now that it has languished for several months.

Comment 35 Shawn Starr 2007-07-17 18:01:47 UTC

Hello John, We're about to test Anaconda 10.1.1.36 to see if this issue still
occurs or not. I will update this bug once I have the required hardware again to
test this.

Comment 36 Shawn Starr 2007-08-09 16:26:51 UTC

I expect to begin testing this issue within 1 to 2 weeks from now.

Comment 39 RHEL Program Management 2007-09-07 19:45:49 UTC

This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 40 John Feeney 2007-09-07 20:12:29 UTC

I believe we were waiting for test results so re-setting needinfo.

Comment 41 Shawn Starr 2007-11-29 18:30:58 UTC

We can't reproduce the issue anymore thus please close bug.

Comment 42 David Lehman 2007-12-07 22:07:31 UTC

Customer can no longer reproduce this so there is no point in spending further
resources on it.

Note You need to log in before you can comment on or make changes to this bug.