Description of problem: I have recently built a new mail server (Postfix & Cyrus-Imap) which authenticates users from my Windows 2003 AD. Prior to installation, I ran the full suite of memory tests from the install cd with no errors. During testing, the system had 2-3 NMI errors and locked up. These seemed to happen most often when accessing the CD/DVD drive. Adding "acpi=off" seemed to have fixed that problem. I migrated our old, failing mail system to this machine and put it into production. This morning, on it's 4th day of operation, I found it locked up and had to do a hard reset. No video, keyboard led's, nothing. /var/log/messages had no indication of the problem, just an IMAP login (or off), then the kernel bootup messages when I rebooted. This afternoon, while attempting to clean up the backup routine, I was attempting to rsync files from one directory to another and the system locked up again. This time, the caps-lock and scroll-lock keys were blinking. In addition to "acpi=off", I have added "nmi_watchdog=1" to my kernel boot options even though I no longer get the NMI error. I think that may be what allowed the keyboard leds to blink this time however. This is a dual PIII-1Ghz with 2GB RAM (Dell/Crucial), Dell PERC, one unused e100 interface and an add-in e1000 card. It's a basic server install with Postfix, Cyrus-imapd, squirrelmail. I have added Trend Micro's Interscan VirusWall for Unix and applied all available updates. Also installed is awstats and keychain. I've applied all RHN updates EXCEPT for the krb5-libs & krb5-workstation. kernel-smp-2.6.9-5.0.3.EL was applied but not booted until after first lockup today. It's now the default and running kernel. SELinux was disabled after first lockup. It had been running in permissive mode, but winbind generated a lot of warnings. Most non-essential services have been disabled (PCMCIA, ISDN, etc) Same system ran with no problems under Windows 2000 and RedHat 9. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 111873 [details] lspc and /var/log/messages
A bit more experimenting (actually, just trying to back up the machine) and I feel I can reproduce the problem fairly reliably now. I have a simple shell script that stops Cyrus-Imapd and Postfix, rsync's the mail data and config directories to a backup directory, restart the services, then creates an iso image from the backup directory to be burned onto DVD. Running this script will lock up the system during the rsync phase. Sometimes, it locks up immediately and no rsync'ing is done other times it gets a small way into it. As the machine locked up when I ran it this morning, I decided to try in run-level 1 and unloaded both e1000 and e100 drivers. Running the script in this environment worked, although I stopped it during the mkisofs phase to get e-mail back on line for the users. Searching on-line does reveal a history of lockups related to the e1000 driver. mail_backup.sh: #!/bin/bash # postfix rc.d script also starts/stops TrendMicro Interscan Viruswall # # Added verbose and progress options to rsync commands to trace when lockups occur. # BACKUPDIR=/backup/mail partition=`grep "^partition-default:" /etc/imapd.conf | cut -f2 -d" "` config=`grep "^configdirectory:" /etc/imapd.conf | cut -f2 -d" "` service postfix stop service cyrus-imapd stop su - cyrus -c "/usr/lib/cyrus-imapd/ctl_mboxlist -d" > /backup/mail/mailboxlist.txt rsync -avR --progress --delete $partition $BACKUPDIR rsync -avR --progress --delete $config $BACKUPDIR rsync -avR --progress --delete /etc/imapd.conf $BACKUPDIR rsync -avR --progress --delete /etc/cyrus.conf $BACKUPDIR rsync -avR --progress --delete /etc/postfix $BACKUPDIR rsync -avR --progress --delete /var/spool/postfix $BACKUPDIR service cyrus-imapd start service postfix start mkisofs -R -J -o /backup/mail/mail-backup.iso /backup/mail/*
I have switched from the e1000 NIC to the e100, rebooted and unloaded the e1000 module (for some reason it loaded after reboot anyway), then attempted a backup. This time the backup worked without a problem. Based on this, I'd say there is a problem with the e1000 module that causes certain configurations to lock up. Please let me know if more information is needed or additional testing is required.
I'm not sure if this is a request for support, or simply an attempt to let us know of a bug. Bugzilla is simply a bug reporting tool, and not a support mechanism. If you require support, please contact support by calling 800-REDHAT1 or by going to http://www.redhat.com/support. Otherwise, thank you for letting us know about the problem!
hmm. we've updated the e1000 in U2 to version 6.0.54-k2-NAPI. If you want to test the beta kernel its at: http://people.redhat.com/~jbaron/rhel4/
"Searching on-line does reveal a history of lockups related to the e1000 driver." Could you provide a pointer to this information? It may help to identify what you are seeing...thanks!
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. Please See https://access.redhat.com/support/policy/updates/errata/ If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.