Bug 150815 - RHEL4 on Dell 1400SC has random lockups
Summary: RHEL4 on Dell 1400SC has random lockups
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Jason Baron
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-03-10 21:37 UTC by Dennis Pinckard
Modified: 2013-03-06 05:58 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 13:28:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
lspc and /var/log/messages (23.17 KB, text/plain)
2005-03-10 21:39 UTC, Dennis Pinckard
no flags Details

Description Dennis Pinckard 2005-03-10 21:37:08 UTC
Description of problem:
I have recently built a new mail server (Postfix & Cyrus-Imap) which
authenticates users from my Windows 2003 AD.  Prior to installation, I
ran the full suite of memory tests from the install cd with no errors.

During testing, the system had 2-3 NMI errors and locked up.  These
seemed to happen most often when accessing the CD/DVD drive.  Adding
"acpi=off" seemed to have fixed that problem.

I migrated our old, failing mail system to this machine and put it
into production.  This morning, on it's 4th day of operation, I found
it locked up and had to do a hard reset.  No video, keyboard led's,
nothing.  /var/log/messages had no indication of the problem, just an
IMAP login (or off), then the kernel bootup messages when I rebooted.

This afternoon, while attempting to clean up the backup routine, I was
attempting to rsync files from one directory to another and the system
locked up again.  This time, the caps-lock and scroll-lock keys were
blinking.

In addition to "acpi=off", I have added "nmi_watchdog=1" to my kernel
boot options even though I no longer get the NMI error.  I think that
may be what allowed the keyboard leds to blink this time however.

This is a dual PIII-1Ghz with 2GB RAM (Dell/Crucial), Dell PERC, one
unused e100 interface and an add-in e1000 card.

It's a basic server install with Postfix, Cyrus-imapd, squirrelmail. 
I have added Trend Micro's Interscan VirusWall for Unix and applied
all available updates.  Also installed is awstats and keychain.

I've applied all RHN updates EXCEPT for the krb5-libs &
krb5-workstation.  kernel-smp-2.6.9-5.0.3.EL was applied but not
booted until after first lockup today. It's now the default and
running kernel.

SELinux was disabled after first lockup.  It had been running in
permissive mode, but winbind generated a lot of warnings.

Most non-essential services have been disabled (PCMCIA, ISDN, etc)

Same system ran with no problems under Windows 2000 and RedHat 9.



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Dennis Pinckard 2005-03-10 21:39:06 UTC
Created attachment 111873 [details]
lspc and /var/log/messages

Comment 2 Dennis Pinckard 2005-03-11 13:38:54 UTC
A bit more experimenting (actually, just trying to back up the
machine) and I feel I can reproduce the problem fairly reliably now.

I have a simple shell script that stops Cyrus-Imapd and Postfix,
rsync's the mail data and config directories to a backup directory,
restart the services, then creates an iso image from the backup
directory to be burned onto DVD.

Running this script will lock up the system during the rsync phase. 
Sometimes, it locks up immediately and no rsync'ing is done other
times it gets a small way into it.

As the machine locked up when I ran it this morning, I decided to try
in run-level 1 and unloaded both e1000 and e100 drivers.  Running the
script in this environment worked, although I stopped it during the
mkisofs phase to get e-mail back on line for the users.

Searching on-line does reveal a history of lockups related to the
e1000 driver.

mail_backup.sh:
#!/bin/bash

# postfix rc.d script also starts/stops TrendMicro Interscan Viruswall
#
# Added verbose and progress options to rsync commands to trace when
lockups occur.
#

BACKUPDIR=/backup/mail

partition=`grep "^partition-default:" /etc/imapd.conf  | cut -f2 -d" "`
config=`grep "^configdirectory:" /etc/imapd.conf  | cut -f2 -d" "`

service postfix stop
service cyrus-imapd stop

su - cyrus -c "/usr/lib/cyrus-imapd/ctl_mboxlist -d" >
/backup/mail/mailboxlist.txt

rsync -avR --progress --delete $partition $BACKUPDIR
rsync -avR --progress --delete $config  $BACKUPDIR
rsync -avR --progress --delete /etc/imapd.conf $BACKUPDIR
rsync -avR --progress --delete /etc/cyrus.conf $BACKUPDIR
rsync -avR --progress --delete /etc/postfix $BACKUPDIR
rsync -avR --progress --delete /var/spool/postfix $BACKUPDIR

service cyrus-imapd start
service postfix start

mkisofs -R -J -o /backup/mail/mail-backup.iso /backup/mail/*


Comment 3 Dennis Pinckard 2005-03-15 14:05:52 UTC
I have switched from the e1000 NIC to the e100, rebooted and unloaded
the e1000 module (for some reason it loaded after reboot anyway), then
attempted a backup.  This time the backup worked without a problem.

Based on this, I'd say there is a problem with the e1000 module that
causes certain configurations to lock up.

Please let me know if more information is needed or additional testing
is required.

Comment 4 Suzanne Hillman 2005-03-15 18:48:50 UTC
I'm not sure if this is a request for support, or simply an attempt to let us
know of a bug. Bugzilla is simply a bug reporting tool, and not a support
mechanism. 

If you require support, please contact support by calling 800-REDHAT1 or by
going to http://www.redhat.com/support.

Otherwise, thank you for letting us know about the problem!

Comment 5 Jason Baron 2005-08-19 16:00:00 UTC
hmm. we've updated the e1000 in U2 to version 6.0.54-k2-NAPI. If you want to
test the beta kernel its at:  http://people.redhat.com/~jbaron/rhel4/

Comment 6 John W. Linville 2005-08-22 15:00:58 UTC
"Searching on-line does reveal a history of lockups related to the 
e1000 driver." 
 
Could you provide a pointer to this information?  It may help to identify what 
you are seeing...thanks! 

Comment 7 Jiri Pallich 2012-06-20 13:28:23 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.


Note You need to log in before you can comment on or make changes to this bug.