Bug 485546

Summary: mount command freezes system with removable SATA drives
Product: Red Hat Enterprise Linux 5 Reporter: Todd <ToddAndMargo>
Component: kernelAssignee: David Milburn <dmilburn>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: urgent Docs Contact:
Priority: low    
Version: 5.2CC: ajb, jfeeney
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-10-30 22:12:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lshw of one of my servers
none
dmesg-c drive mounted
none
dmesg-c drive dismounted
none
dmesg -c with drive removed
none
dmesg -c trying to mount the removed drive none

Description Todd 2009-02-14 03:07:59 UTC
Hi All,

I really, really need one of the Red Hat engineers to please accept this as one of their volunteer bugs.  It is very critical.

I just uncovered a critical bug when using removable SATA (not eSATA) drives that are powered off or removed from there sleeve that caused CentOS 5.2 to freeze: no keyboard, no mouse, Samba server stops, all other network activity stops. Strangely, no blinking lights on the keyboard. I have three servers affected by this.

My motherboard (Supermicro X7DVL-E):
http://supermicro.com/products/motherboard/Xeon1333/5000V/X7DVL-E.cfm [^]

My removable drive carriers (CRU 8510-5002-9500):
http://www.cru-dataport.com/htmldocs/products/dataport25/DP25.html [^]

My Removable Drives (Seagate ST9320421ASG):
http://www.seagate.com/ww/v/index.jsp?vgnextoid=006442b3f64f9110VgnVCM100000f5ee0a0aRCRD&locale=en-US [^]

Operating System: CentOS 5.2
$ uname -r
2.6.18-92.1.22.el5

Currently, on three servers, I use removable SATA (not eSATA) drives for backup. Unless they are accessed, they remain unmounted. From my fstab:

/dev/sdb1 /lin-bak ext3 defaults,noauto 0 0

Using /etc/crontab, at 14:00 and 14:30 hours /dev/sdb1 is check to see if the operator remembered to change the disk (they typically have five in rotation). At 23:00 hours crontab fires off the backup routine.

Problem: if the removable sleeve is empty (the operator forgot to insert a new drive or removed it at 14:00 to 14:30) or the power switch on the removable sleeve is not turned back on, the “mount /dev/sdb1” command in the disk checker and the backup script will hang for one to two minutes then the entire system will freeze. You have to use the one fingered reset to recover from it.

Steps to reproduce:

0) goes without saying, but backup your stuff!

1) power off your server. Removable SATA drives are not plug and play, so you have to be powered off to register them in /dev (someday, maybe.)

2) install a removable (not eSATA) ext3 formatted drive as a second drive (not your root). Make sure it is powered on (the switch on the front of the sleeve is on and locked).

3) power back on. Make sure the new drive shows up in /dev. Make the appropriate entry in fstab (use mine above). Create a “/lin-bak” directory.

4) make sure you can mount the drive with “mount /lin-bak”. Then unmount the drive (“umount /lin-bak”). Make sure the drive is dismounted by checking mtab (“cat /etc/mtab”)

5) power off the removable SATA drive and/or remove the removable SATA drive from the sleeve

6) attempt to remount /lin-bak with “mount /lin-bak”

“mount” will hang for about one to two minutes, then your system will hard freeze.


Many thanks,
-T

Comment 1 Todd 2009-02-17 02:39:09 UTC
Created attachment 332164 [details]
lshw of one of my servers

While searching for a work around to test if my removable drive was actually in the sleeve before call the "mount" command, a respondent said the the output of my "lshw" command would be helpful.  Please note that two of the tree servers
only have one partition on /dev/sdb.  This output will show the third server which has two.  In my write up, I refer only to /dev/sdb1, as it is the majority of my systems.

-T

Comment 2 David Milburn 2009-02-20 21:24:55 UTC
Would you please boot with "log_buf_len=1000000" kernel parameter and then
after booting

# echo 509 > /proc/sys/dev/scsi/logging_level

And then please capture and attach "dmesg -c" after step #4 (unmount), and also after step #5 and step #6?

Comment 3 Todd 2009-02-22 01:50:18 UTC
Created attachment 332836 [details]
dmesg-c drive mounted

This is dmesg -c with the drive mounted (step 4)

Comment 4 Todd 2009-02-22 01:51:44 UTC
Created attachment 332837 [details]
dmesg-c drive dismounted

This is dmesg -c after the drive gets dismounted from step 4

Comment 5 Todd 2009-02-22 01:52:42 UTC
Created attachment 332838 [details]
dmesg -c with drive removed

This is dmesg -c with the drive removed.  Step 5

Comment 6 Todd 2009-02-22 01:54:41 UTC
Created attachment 332839 [details]
dmesg -c trying to mount the removed drive

This is dmesg -c trying to mount the drive when the drive is removed (you have two minutes to run the command before the system freezes).  Step 6

Comment 7 Todd 2009-02-22 03:51:37 UTC
Hi All,

   I do not know if this is relevant to this, but Robert Hancock over at kernel.org just wrote me this:


"If the device is in AHCI mode then the ata_piix driver won't load for it - at least it won't in current kernels, I can't say for sure that it won't in the CentOS 5 version.. You do need to get it using the ahci driver instead of ata_piix or hotplug definitely won't work.

You can try changing the boot initrd to try to load the AHCI driver instead by changing the scsi_hostadapter entry in /etc/modprobe.conf to be ahci instead of ata_piix, then rebuilding the initrd or reinstalling the kernel RPM. However, if the BIOS isn't set up properly for AHCI mode to work, you'll have to either boot up in rescue mode and fix it, or boot up from a different kernel entry in grub."

I am afraid rebuilding initrd is over my head.  But, maybe something in what he wrote someone else will understand.

-T

Comment 8 Todd 2009-02-22 03:54:16 UTC
My /etc/modprobe.conf:

alias scsi_hostadapter sata_sil
alias eth1 e1000e
alias scsi_hostadapter1 megaraid_mbox
alias scsi_hostadapter2 ata_piix
alias snd-card-0 snd-intel8x0
options snd-card-0 index=0
options snd-intel8x0 index=0
remove snd-intel8x0 { /usr/sbin/alsactl store 0 >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0

Comment 9 Todd 2009-02-22 22:17:24 UTC
Hi All,

I have been doing a bunch of research and have been corresponding with the folks over at kernel.org.

Here is the scoop.

1) my motherboard's bios is corked.  Supermicro has since added all kinds of ACHI support to the BIOS.  (I will be ordering new BIOS chips on Monday.)

2) You only "automatically" load the ACHI drivers at installation time.  No auto detect after the fact.  (But, you can do it manually.)

3) From my /etc/modprobe.conf

        alias scsi_hostadapter sata_sil
        alias scsi_hostadapter1 megaraid_mbox
        alias scsi_hostadapter2 ata_piix

The "ata_piix" driver is the WRONG driver for "ACHI Hot Pluggable" devices.  The driver I need, but did not get, due to my crappy BIOS, at install time, should be

       alias scsi_hostadapter2 ahci

Which won't work properly until I get my new BIOS chips.  When working correctly, I should be able to comment out the "ata_piix" driver completely.  (And, the ACHI driver should work fine with my SATA DVD/CD writer.) 


So "New" description of symptom:

"mount" will hang for approximately two minutes and then the entire system will freeze when attempting to execute a "mount" on a physically detached Hot Pluggable device when accidentally using the wrong (ata_piix) driver.

New steps to reproduce:

0) goes without saying, but backup your stuff!

1) verify that your /etc/modprobe is misconfigured.  To reproduce this bug, your should only be using the "ata_piix" driver (alias scsi_hostadapter2 ata_piix) and not the (correct) "achi" (alias scsi_hostadapter2 achi)

2) power off your server. Removable SATA drives only automount with the ACHI driver, so you have to be powered off to register them in /dev.

3) install a removable (not eSATA) ext3 formatted drive as a second drive (not your root). Make sure it is powered on (the switch on the front of the sleeve is on and locked).

4) power back on. Make sure the new drive shows up in /dev. Make the appropriate entry in fstab.   For instance
        /dev/sdb1 /lin-bak ext3 defaults,noauto 0 0).
Create a /lin-bak directory.

5) make sure you can mount the drive with mount /lin-bak. Then unmount the drive (umount /lin-bak). Make sure the drive is dismounted by checking mtab (cat /etc/mtab)

6) power off the removable SATA drive and/or remove the removable SATA drive from the sleeve

7) attempt to remount /lin-bak with "mount /lin-bak"

"mount" will hang for about one to two minutes, then your system will hard freeze.  Note: if you are quick enough, and repower the device, you can save yourself the freeze up.


It will take me two to three weeks to upgrade my three server's BIOS.  I can delay the update on one of the servers (mine) if requested by this forum.  Let me know.

Many thanks,
-T

Comment 10 Todd 2009-03-05 00:49:09 UTC
Hi All,

A comment and an update.

A comment.  My motherboard's controller is a "hot plugging" controller.  Whether or not the BIOS is configured for AHCI, comparing removing a hard drive from my controller to "rip[ing] out a memory chip or your VGA", as has been suggested in other quarters, is a great misunderstanding of the technology.

I far as I can tell, AHCI is not the default BIOS settings on any motherboard I have checked.  I had no idea I was mis-configured until I took down an accounting firm's server right in the middle of tax season.  I had no idea there was any problem hot plugging a device in to or out of my motherboard's hot plugging controller without the correct driver being loaded into Linux.  From my posting around the web on this issue, very, very few individuals know either.  

So, the object of this bug is to give the user an error message when he tries to mount a missing drive when he is using the wrong driver to operate what he thinks is a properly configured system, instead of crashing the server.  My uninformed opinion would be to look at the two minute time out and find why it crashes instead of safely reentering.

An update.  I got my first server's BIOS chip changed.  Updating initrd was way simpler than I had feared.  (My boot device is a RAID controller, which does not require ahci or ata_piix to operate: I can still boot if I goof ahci or ata_piix.  Fortunately, it never came to that.)  My removable SATA drive works perfectly under my new BIOS's AHCI function, just like its USB and Firewire Hot Plugging cousins (and not at all like ripping my memory or VGA out of my system).


I will be updating my second server tomorrow.  I will wait on updating my third server (my office's server) until I hear back from you guys.

-T

Comment 13 John Feeney 2013-10-30 22:12:06 UTC
This Bugzilla has been reviewed by Red Hat and is not planned on being
addressed in Red Hat Enterprise Linux 5, and therefore is being closed.
If this bug is critical to production systems, please contact your Red
Hat support representative and provide a sufficient business justification
in order to re-open it.