Bug 667485 - High frequency of load/unload cycles on laptop systems
High frequency of load/unload cycles on laptop systems
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.0
x86_64 Linux
low Severity medium
: rc
: ---
Assigned To: Tom Coughlan
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2011-01-05 13:33 EST by Mike Jang
Modified: 2013-10-14 05:42 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-09 18:23:48 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mike Jang 2011-01-05 13:33:57 EST
Description of problem:

Heavy cycling of hard disks on laptops -- 200,000 cycles since RHEL 6 installation in mid Nov 2010. (current time: Jan 5, 2011), as confirmed by the Load_Cycle_Count output to the smartctl -a /dev/sda command.

Version-Release number of selected component (if applicable):

5.39.1 (perhaps it should also apply to hdparm, version 9.16-3.4

How reproducible:

Install RHEL 6 on a laptop system. (Not entirely sure if it's reproducible, as I don't have a second laptop with RHEL 6.)

Steps to Reproduce:
1. Install RHEL 6
2. Wait a bit
3. Hard disk wears out
  
Actual results:

200,000 cycles on a laptop hard drive (T410, 500GB Hitachi) in 7 weeks of use.

Expected results:

Minimal cycling, perhaps once per reboot.

Additional info:

It seems reminiscent of the following Ubuntu 8.04 bug - https://bugs.launchpad.net/ubuntu/+source/acpi-support/+bug/59695

It was noted as "critical" there. Your choice on whether to set it to a similar priority. 

FWIW, the following command seems to have stopped the problem, at least for now:

hdparm -B 254 /dev/sda
Comment 2 Michal Hlavinka 2011-01-06 08:19:24 EST
smartctl just reports these values, this bug should be more likely filled with component=kernel
Comment 3 Mike Jang 2011-01-06 10:11:20 EST
FWIW, my hard drive is now in a failure cycle, and I'm in the middle of backing up the latest bits. 

I had only gotten this laptop in Sept -- and installed RHEL 6 a couple of days after release. Before RHEL 6, I had Ubuntu 10.04 installed on that system, where the bug was previously addressed. While I didn't record the number, the cycling was relatively minimal. So I'm guessing that over the lifetime when RHEL 6 was installed on this system, it cycled maybe 500x/hour, probably more.

The goal listed by Ubuntu in their fix is about 15x/hour.

In any case, while I had "only" 200,000+ cycles on this particular hard drive, I'm pretty sure it was all focused on the RHEL 6 root directory volume, as that's proving to be the most difficult to dd_rescue.

They had their version of this bug in their acpi-support package. I do not know for sure what the RHEL 6 analog to that package is (I suppose acpid).
Comment 4 RHEL Product and Program Management 2011-01-06 23:18:04 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 5 Suzanne Yeghiayan 2011-01-07 11:07:10 EST
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.
Comment 6 Eric Sandeen 2011-01-31 14:50:54 EST
Is there any indication that something in the OS is directly issuing spindownr equests, or changing the spindown timer?
Comment 7 Eric Sandeen 2011-01-31 15:36:17 EST
Also have you done any other tuning, such as use of laptop mode, etc?

I don't think this is a kernel bug, but it would be good to understand what is causing it, and route it to the proper component there does seem to be a culprit in the OS.
Comment 8 Mike Jang 2011-01-31 15:52:22 EST
As noted in the original comment, the following command

hdparm -B 254 /dev/sda 

reduced the severity of the problem to something like a couple of dozen cycles / workday.

After reviewing the Ubuntu version of the bug (to which I linked), I revised that to 

hdparm -B 200 /dev/sda

which reduces the cycling to 1 / per reboot. (I've added the noted hdparm command to my /etc/rc.local). It's certainly possible that it's reduced my battery life, but I haven't collected data on that.

Based on my reading of the Ubuntu bug report, the severity is somewhat dependent on the hardware, which to me suggests different power save schemes. FWIW, their bug is reported against their "acpi-support" package.

The bug was first reported on Ubuntu back in '06, and the last options in the report suggest that it still isn't fully addressed. 

I had RHEL 5 installed on a previous laptop, and did not encounter the cycling problem then. However, the laptop used a different hard drive from a different mfr. Nevertheless, it's odd that I'm seeing the problem first encountered on Ubuntu in 2006 over 4 years later with the release of RHEL 6.
Comment 9 Eric Sandeen 2011-01-31 16:04:19 EST
I understand that you can use hdparm to set APM on the drive, but I just wonder if you have found evidence that the OS is somehow overriding that value... an interesting test might be to boot rhel6 (or rhel5?) in rescue mode on this laptop, query the drive, and see what the value is.  It's possible that the BIOS is setting it to aggressive power management.  If it doesn't seem set by the bios, boot rhel6 without your boot-time hdparm call, query it... see if it has changed through some other mechanism in the OS, etc.
Comment 10 Mike Jang 2011-01-31 17:36:25 EST
OK, I tried booting without the stuff I put in /etc/rc.local commented out

When I then run

 hdparm -B /dev/sda

I get 

APM_level = 128

which if I understand correctly, is far too aggressive. In the three or so minutes I had it going, the Load_Cycle_Count (output of smartctl -a /dev/sda) went up by 10. 

(Strangely enough, when I tried it [the hdparm -B /dev/sda command] in rescue mode, I ended up with an error message - 

/dev/sda(1): Syntax error, invalid char 'ë') with the bits that look like an umlaut.
Comment 11 Eric Sandeen 2011-01-31 18:04:47 EST
hm.... well, the question I have is whether it is your bios setting that value, or whether it might be the default on the disk, reset on every power cycle.   I suspect that it is, and that this is not a RHEL6 bug, but rather the default for your bios or your drive.

If you can find a rescue disk with an hdparm that works with the drive, you could hopefully confirm or deny that theory...
Comment 12 Ric Wheeler 2011-01-31 22:23:49 EST
Adding a power management expert - anything in power management in RHEL6 that pokes at drives to control spin down?
Comment 13 RHEL Product and Program Management 2011-02-01 00:51:03 EST
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.
Comment 14 Matthew Garrett 2011-02-01 03:19:59 EST
Note that we're talking about head parking rather than spindown. There's no direct control for this, rather it's a consequence of the APM setting on the drive and the drive's own policy given the access pattern. I don't believe we do anything to set the APM level out of the box, but I'll check to confirm that - the other question is then what alternative operating systems do.
Comment 15 RHEL Product and Program Management 2011-02-01 14:04:07 EST
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.
Comment 16 RHEL Product and Program Management 2011-04-03 22:26:07 EDT
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 17 Mike Jang 2011-05-30 12:52:11 EDT
Dear Matthew, 

Do review the Ubuntu bug that I linked to. SUSE has released its own partial solution at https://bugzilla.novell.com/show_bug.cgi?id=386555 , but from the comments, it hasn't addressed all laptop hard drives on this issue. Apparently, this is also being monitored in detail on the following kernel.org wiki:

https://ata.wiki.kernel.org/index.php/Known_issues#Hardware_compatibility_issues
Comment 18 RHEL Product and Program Management 2011-10-07 11:20:10 EDT
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.
Comment 19 duffmckagan 2013-09-06 03:30:14 EDT
As of RHEL 6.4, has this problem been resolved or not yet?
Comment 20 Eric Sandeen 2013-09-06 15:40:21 EDT
Based on the suse bug and the kernel wiki, it seems likely that this is a hardware/bios flaw, not due to any bug in the linux kernel or other packages.  However, suse issued a script to run at boot time to correct it:

# storage-fixup			- Tejun Heo <teheo@suse.de>
#
# Script to issue fix up commands for weird disks.  This is primarily
# to adjust ATA APM setting.  Some laptop BIOSen set this value too
# aggressively causing frequent head unloads which can kill the drive
# quickly.  This script should be called during boot and resume.  It
# examines rules from /etc/stroage-fixup.conf and executes matching
# commands.

I'm not sure what the RHEL precedent is for shipping hardware-behavior-workaround-scripts...
Comment 21 Ric Wheeler 2013-09-08 07:01:29 EDT
I think that we might need something here added so that we can set this kind of thing - more or less a boot sequence hook to override per disk bad defaults.
Comment 22 Tejun Heo 2013-09-08 08:53:22 EDT
Hello,

So, I've been tracking the issue for a while, contacted some manufacturers, and wrote the above script.

This is a problem caused by either hdd or laptop manufacturer setting APM very aggressively to lower power consumption. There is no clear guideline on how to implement APM in the spec but most vendors seem to take pretty simple approach of fixed timeout, and, of course, very short fixed timeout can be very finicky and leads to extremely frequent unloads if the parameters they assumed change even slightly - different OS (doesn't event have to be Linux, a different version of windows), a program which ends up generating different IO patterns (e.g. anti-virus), whatever really.

It doesn't even have to involve particularly bad behaviors. It has been a while but I once calculated the rates the count is increasing against their power-on hours of a number of cases and many of them seem to be targeting power-on hours of one or two years, which probably is completely acceptable for vast majority of use cases. Also, it's not like hdds die at the moment they reach head unload max. There's some slack there.

I no longer think we should be setting APM value explicitly depending on the machine / drive. It's a conscious trade-off made by the manufacturer(s) and we don't even know what each APM level means - e.g. certain APM values can make some drive run a lot hotter to the point of overheating itself. Those override values being applied are likely making some aspects of things better while worsening others - things which aren't immediately obvious. There's no way to easily tell whether a hard drive is consuming more power short of resorting to inserting an amp meter on the line supplying power to the drive.

So, my conclusion is, the device is manufactured that way by the vendor and I don't think it's a good idea to meddle with that especially given the limited amount of information we have and can have. It's the calculated risk that the vendors are taking likely based on power consumption difference, IO patterns expected on most setups (mostly idle), avg power-on hours of sold devices (pretty low), failure rate and so on. These are devices intentionally built the way they are.

Thanks.
Comment 23 Eric Sandeen 2013-09-09 12:55:52 EDT
Tejun, thanks for the very clear & persuasive answer - it sounds like we should close this NOTABUG; users who have such hardware can pursue other custom tunings using their own best judgement, not our best guesses.

Thanks,

-Eric

Note You need to log in before you can comment on or make changes to this bug.