Red Hat Bugzilla – Bug 506254
Palimpsest reports bad sectors on a good disk
Last modified: 2009-12-28 18:54:53 EST
Description of problem:
Palimpsest says my 100GB disk is failing due to bad sectors. However, the disk is actually ok.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install F11 Live CD x86_64
3. Log in to Gnome Desktop
Palimpsest says a disk is failing due to bad sectors
sudo fdisk -l /dev/sda
Disk /dev/sda: 100.0 GB, 100030242816 bytes
240 heads, 63 sectors/track, 12921 cylinders
Units = cylinders of 15120 * 512 = 7741440 bytes
Disk identifier: 0x94e494e4
Device Boot Start End Blocks Id System
/dev/sda1 1 2298 17370112 7 HPFS/NTFS
/dev/sda2 * 2299 12922 80312904 5 Extended
/dev/sda5 2299 2314 120928+ 83 Linux
/dev/sda6 2315 3962 12458848+ 83 Linux
/dev/sda7 3963 4141 1353208+ 82 Linux swap / Solaris
/dev/sda8 4142 11246 53713768+ b W95 FAT32
/dev/sda9 11247 12013 5798488+ 83 Linux
/dev/sda10 12014 12921 6864448+ 7 HPFS/NTFS
[root@Xtigyro--fedora log]# devkit-disks --show-info /dev/sda
Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda
detected at: Tue 16 Jun 2009 02:35:35 PM EEST
system internal: 1
has media: 1 (detected at Tue 16 Jun 2009 02:35:35 PM EEST)
detects change: 0
detection by polling: 0
detection inhibitable: 0
detection inhibited: 0
is read only: 0
is mounted: 0
mounted by uid: 0
presentation hide: 0
block size: 512
job underway: no
require eject: 0
if speed: (unknown)
ATA SMART: Updated at Tue 16 Jun 2009 02:35:35 PM EEST
bad sectors: Yes
attributes: One ore more attributes exceed threshold
temperature: 43° C / 109° F
powered on: 345 days
offline data: successful (645 second(s) to complete)
self-test status: success or never (0% remaining)
ext./short test: available
conveyance test: not available
start test: available
abort test: available
short test: 2 minute(s) recommended polling time
ext. test: 66 minute(s) recommended polling time
conveyance test: 0 minute(s) recommended polling time
Attribute Current/Worst/Threshold Status Value Type Updates
raw-read-error-rate 100/ 99/ 62 good 0 Prefail Online
throughput-performance 106/100/ 40 good 0 Prefail Offline
spin-up-time 247/100/ 33 good 1 msec Prefail Online
start-stop-count 98/ 98/ 0 n/a 3224 Old-age Online
reallocated-sector-count 100/100/ 5 FAIL 1900724 sectors Prefail Online
seek-error-rate 100/100/ 67 good 0 Prefail Online
seek-time-performance 128/100/ 40 good 0 Prefail Offline
power-on-hours 82/ 82/ 0 n/a 345 days Old-age Online
spin-retry-count 100/100/ 60 good 0 Prefail Online
power-cycle-count 99/ 99/ 0 n/a 2650 Old-age Online
g-sense-error-rate 100/ 92/ 0 n/a 0 Old-age Online
power-off-retract-count 100/100/ 0 n/a 2228329 Old-age Online
load-cycle-count 89/ 89/ 0 n/a 118043 Old-age Online
temperature-celsius-2 127/100/ 0 n/a 43C / 109F Old-age Online
reallocated-event-count 100/100/ 0 n/a 22 Old-age Online
current-pending-sector 100/100/ 0 n/a 0 sectors Old-age Online
offline-uncorrectable 100/100/ 0 n/a 0 sectors Old-age Offline
udma-crc-error-count 200/253/ 0 n/a 0 Old-age Online
[root@Xtigyro--fedora log]# rpm -q DeviceKit-disks gvfs libatasmart
[root@Xtigyro--fedora log]# sudo skdump /dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 95396 MiB
SMART Available: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 645 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: no
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 66 min
Conveyance Self-Test Polling Time: 0 min
Bad Sectors: 1900724 sectors
Powered On: 11.5 months
Power Cycles: 2650
Average Powered On Per Power Cycle: 3.1 h
Temperature: 43.0 C
Overall Status: BAD_SECTOR
ID# Name Value Worst Thres Pretty Raw Type Updates Good
1 raw-read-error-rate 100 99 62 0 0x000000000000 prefail online yes
2 throughput-performance 106 100 40 n/a 0xb71100000000 prefail offline yes
3 spin-up-time 247 100 33 1 ms 0x010000000d00 prefail online yes
4 start-stop-count 98 98 0 3224 0x980c00000000 old-age online n/a
5 reallocated-sector-count 100 100 5 1900724 sectors 0xb4001d000000 prefail online no
7 seek-error-rate 100 100 67 0 0x000000000000 prefail online yes
8 seek-time-performance 128 100 40 n/a 0x240000000000 prefail offline yes
9 power-on-hours 82 82 0 11.5 months 0x562000000000 old-age online n/a
10 spin-retry-count 100 100 60 0 0x000000000000 prefail online yes
12 power-cycle-count 99 99 0 2650 0x5a0a00000000 old-age online n/a
191 g-sense-error-rate 100 92 0 0 0x000000000000 old-age online n/a
192 power-off-retract-count 100 100 0 2228329 0x690022000000 old-age online n/a
193 load-cycle-count 89 89 0 118044 0x1ccd01000000 old-age online n/a
194 temperature-celsius-2 127 100 0 43.0 C 0x2b0003003500 old-age online n/a
196 reallocated-event-count 100 100 0 22 0x160000000000 old-age online n/a
197 current-pending-sector 100 100 0 0 sectors 0x000000000000 old-age online n/a
198 offline-uncorrectable 100 100 0 0 sectors 0x000000000000 old-age offline n/a
199 udma-crc-error-count 200 253 0 0 0x000000000000 old-age online n/a
It looks like your disk does have bad sectors (1900724 according to the SMART report), which could well be correct. Not to worry though - hard drives have spare sectors for this purpose and your drive have reallocated the sectors to fresh spares.
According to the SMART attributes, your drive is well above the failure threshold: Current value for reallocated-sector-count is 100 with a threshold of 5 - for a failure the current value should be equal or under the threshold. man smartctl and search for "-A".
It looks like palimpsest is just trigger happy. This bug is possibly a dup of bug #498115
Hmm, 1900724 is a very high value. If it is true your hard disk is pretty broken and I wouldn't trust it anymore.
However, I am assuming that this is probably just a parse failure, so I'll now disable the parsing of that attribute in libatasmart for your disk.
I now added a quirk upstream for this:
I don't think this fix is correct - liatasmart should *completely ignore* the real values reported for the purpose of determining if the state of the drive is in error or not.
In order to decide if the drive is in failure status, libatasmart should only consult the "Current value" vs. "Threshold", according to the logic described in man smartctl. If the drive considers 1900724 re-allocated sectors is a value of 100 vs a threshold of 5, then libatasmart should not second guess that.
I forgot to mention that as I described in comment #47 bug #498115 , libatasmart describes even 1 reallocated sector as a fail status - which is obviously wrong. Adding a quirk for every such case is not the write way to go.
We decided that it makes more sense to actually second-guess the drive here. Manufacturers tend to set those threshold artifically high, to make their drives look better.
However you are right, checking against 0 is a bit too much. This will be changed to check against a threshold that is dependant on the actual size of the disk.
I think that second guessing the manufacturer is not a good idea on the face of it - unless proven that as a rule of thumb implementors of the SMART standard cannot be relied upon to implement it properly, I would think that the default implementation in Linux should be to work with the standard.
Baring that this is not likely to happen (and I acknowledge that there may be issues with some drives), I would hope that there would be some switch that allows me to ask libatasmart to honor the standard and the device manufacturer settings, on my machine.
(In reply to comment #7)
> I think that second guessing the manufacturer is not a good idea on the face of
> it - unless proven that as a rule of thumb implementors of the SMART standard
> cannot be relied upon to implement it properly, I would think that the default
> implementation in Linux should be to work with the standard.
Nice idea. However, that doesn't work. For the simple reason that there is no "SMART standard". There is simply no official spec of the SMART attributes stuff. There was a draft spec which was pulled back. Most vendors do follow that but departed from that in many many ways, sometimes in a compatible way, sometimes in an incompatible way. A good part of the information libatasmart parses is not documented anywhere, it's simply something that was observed that all (or some) manufacturars seem to agree on or follow, even if it isn't set in stone.
libatasmart tries to make sense of the data available in the SMART information as good as it can. But since there is no official specification we need to take the data that is a available, distill some information from it, verify that it makes sense and then present that to the user.
Also note that libatasmart .14 now compares the number of bad sectors against a threshold that depends on the disk size.
Ok, thanks for the response.
I still think there is room in palimpsest for showing where libatasmart thinks there is a failure while the manufacturer's "current value">"threshold value" says its OK(*) and in such cases allow the user to override the notification for just that property so that it reports failures according the the manufacturer's values.
(*) even if these values are stupid, like setting the current value to always 100.