506254 – Palimpsest reports bad sectors on a good disk

Bug 506254 - Palimpsest reports bad sectors on a good disk

Summary: Palimpsest reports bad sectors on a good disk

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	libatasmart
Sub Component:
Version:	11
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Lennart Poettering
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-16 12:06 UTC by Miro Hadzhiev
Modified:	2009-12-28 23:54 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2009-08-05 22:02:32 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Miro Hadzhiev 2009-06-16 12:06:35 UTC

Description of problem:
Palimpsest says my 100GB disk is failing due to bad sectors. However, the disk is actually ok.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install F11 Live CD x86_64
2. Boot
3. Log in to Gnome Desktop

  
Actual results:
Palimpsest says a disk is failing due to bad sectors

Expected results:


Additional info:

sudo fdisk -l /dev/sda

"
Disk /dev/sda: 100.0 GB, 100030242816 bytes
240 heads, 63 sectors/track, 12921 cylinders
Units = cylinders of 15120 * 512 = 7741440 bytes
Disk identifier: 0x94e494e4

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1        2298    17370112    7  HPFS/NTFS
/dev/sda2   *        2299       12922    80312904    5  Extended
/dev/sda5            2299        2314      120928+  83  Linux
/dev/sda6            2315        3962    12458848+  83  Linux
/dev/sda7            3963        4141     1353208+  82  Linux swap / Solaris
/dev/sda8            4142       11246    53713768+   b  W95 FAT32
/dev/sda9           11247       12013     5798488+  83  Linux
/dev/sda10          12014       12921     6864448+   7  HPFS/NTFS
"


[root@Xtigyro--fedora log]# devkit-disks --show-info /dev/sda
Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda
  native-path:             /sys/devices/pci0000:00/0000:00:1f.2/host0/target0:0:0/0:0:0:0/block/sda
  device:                  8:0
  device-file:             /dev/sda
    by-id:                 /dev/disk/by-id/ata-HTS541010G9SA00_MP2ZX0XLGK4BES
    by-id:                 /dev/disk/by-id/scsi-SATA_HTS541010G9SA00_MP2ZX0XLGK4BES
    by-path:               /dev/disk/by-path/pci-0000:00:1f.2-scsi-0:0:0:0
  detected at:             Tue 16 Jun 2009 02:35:35 PM EEST
  system internal:         1
  removable:               0
  has media:               1 (detected at Tue 16 Jun 2009 02:35:35 PM EEST)
    detects change:        0
    detection by polling:  0
    detection inhibitable: 0
    detection inhibited:   0
  is read only:            0
  is mounted:              0
  mount paths:             
  mounted by uid:          0
  presentation hide:       0
  presentation name:       
  presentation icon:       
  size:                    100030242816
  block size:              512
  job underway:            no
  usage:                   
  type:                    
  version:                 
  uuid:                    
  label:                   
  partition table:
    scheme:                mbr
    count:                 8
  drive:
    vendor:                ATA
    model:                 HTS541010G9SA00
    revision:              MBZO
    serial:                MP2ZX0XLGK4BES
    ejectable:             0
    require eject:         0
    media:                 
      compat:             
    interface:             ata
    if speed:              (unknown)
    ATA SMART:             Updated at Tue 16 Jun 2009 02:35:35 PM EEST
      assessment:          PASSED
      bad sectors:         Yes
      attributes:          One ore more attributes exceed threshold
      temperature:         43° C / 109° F
      powered on:          345 days
      offline data:        successful (645 second(s) to complete)
      self-test status:    success or never (0% remaining)
      ext./short test:     available
      conveyance test:     not available
      start test:          available
      abort test:          available
      short test:            2 minute(s) recommended polling time
      ext. test:            66 minute(s) recommended polling time
      conveyance test:       0 minute(s) recommended polling time
===============================================================================
 Attribute       Current/Worst/Threshold  Status   Value       Type     Updates
===============================================================================
 raw-read-error-rate         100/ 99/ 62   good    0           Prefail  Online 
 throughput-performance      106/100/ 40   good    0           Prefail  Offline
 spin-up-time                247/100/ 33   good    1 msec      Prefail  Online 
 start-stop-count             98/ 98/  0    n/a    3224        Old-age  Online 
 reallocated-sector-count    100/100/  5   FAIL    1900724 sectors Prefail  Online 
 seek-error-rate             100/100/ 67   good    0           Prefail  Online 
 seek-time-performance       128/100/ 40   good    0           Prefail  Offline
 power-on-hours               82/ 82/  0    n/a    345 days    Old-age  Online 
 spin-retry-count            100/100/ 60   good    0           Prefail  Online 
 power-cycle-count            99/ 99/  0    n/a    2650        Old-age  Online 
 g-sense-error-rate          100/ 92/  0    n/a    0           Old-age  Online 
 power-off-retract-count     100/100/  0    n/a    2228329     Old-age  Online 
 load-cycle-count             89/ 89/  0    n/a    118043      Old-age  Online 
 temperature-celsius-2       127/100/  0    n/a    43C / 109F  Old-age  Online 
 reallocated-event-count     100/100/  0    n/a    22          Old-age  Online 
 current-pending-sector      100/100/  0    n/a    0 sectors   Old-age  Online 
 offline-uncorrectable       100/100/  0    n/a    0 sectors   Old-age  Offline
 udma-crc-error-count        200/253/  0    n/a    0           Old-age  Online 




[root@Xtigyro--fedora log]# rpm -q DeviceKit-disks gvfs libatasmart
DeviceKit-disks-004-3.fc11.x86_64
gvfs-1.2.3-2.fc11.x86_64
libatasmart-0.12-3.fc11.x86_64




[root@Xtigyro--fedora log]# sudo skdump /dev/sda
Device: /dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
Size: 95396 MiB
Model: [HTS541010G9SA00]
Serial: [MP2ZX0XLGK4BES]
Firmware: [MBZOC60P]
SMART Available: yes
Quirks:
Awake: yes
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was completed without error.]
Total Time To Complete Off-Line Data Collection: 645 s
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]
Percent Self-Test Remaining: 0%
Conveyance Self-Test Available: no
Short/Extended Self-Test Available: yes
Start Self-Test Available: yes
Abort Self-Test Available: yes
Short Self-Test Polling Time: 2 min
Extended Self-Test Polling Time: 66 min
Conveyance Self-Test Polling Time: 0 min
Bad Sectors: 1900724 sectors
Powered On: 11.5 months
Power Cycles: 2650
Average Powered On Per Power Cycle: 3.1 h
Temperature: 43.0 C
Overall Status: BAD_SECTOR
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good
  1 raw-read-error-rate         100    99    62   0           0x000000000000 prefail online  yes
  2 throughput-performance      106   100    40   n/a         0xb71100000000 prefail offline yes
  3 spin-up-time                247   100    33   1 ms        0x010000000d00 prefail online  yes
  4 start-stop-count             98    98     0   3224        0x980c00000000 old-age online  n/a
  5 reallocated-sector-count    100   100     5   1900724 sectors 0xb4001d000000 prefail online  no 
  7 seek-error-rate             100   100    67   0           0x000000000000 prefail online  yes
  8 seek-time-performance       128   100    40   n/a         0x240000000000 prefail offline yes
  9 power-on-hours               82    82     0   11.5 months 0x562000000000 old-age online  n/a
 10 spin-retry-count            100   100    60   0           0x000000000000 prefail online  yes
 12 power-cycle-count            99    99     0   2650        0x5a0a00000000 old-age online  n/a
191 g-sense-error-rate          100    92     0   0           0x000000000000 old-age online  n/a
192 power-off-retract-count     100   100     0   2228329     0x690022000000 old-age online  n/a
193 load-cycle-count             89    89     0   118044      0x1ccd01000000 old-age online  n/a
194 temperature-celsius-2       127   100     0   43.0 C      0x2b0003003500 old-age online  n/a
196 reallocated-event-count     100   100     0   22          0x160000000000 old-age online  n/a
197 current-pending-sector      100   100     0   0 sectors   0x000000000000 old-age online  n/a
198 offline-uncorrectable       100   100     0   0 sectors   0x000000000000 old-age offline n/a
199 udma-crc-error-count        200   253     0   0           0x000000000000 old-age online  n/a

Comment 1 Oded Arbel 2009-07-13 10:05:17 UTC

It looks like your disk does have bad sectors (1900724 according to the SMART report), which could well be correct. Not to worry though - hard drives have spare sectors for this purpose and your drive have reallocated the sectors to fresh spares.

According to the SMART attributes, your drive is well above the failure threshold: Current value for reallocated-sector-count is 100 with a threshold of 5 - for a failure the current value should be equal or under the threshold. man smartctl and search for "-A".

It looks like palimpsest is just trigger happy. This bug is possibly a dup of bug #498115

Comment 2 Lennart Poettering 2009-08-05 21:16:30 UTC

Hmm, 1900724 is a very high value. If it is true your hard disk is pretty broken and I wouldn't trust it anymore.

However, I am assuming that this is probably just a parse failure, so I'll now disable the parsing of that attribute in libatasmart for your disk.

Comment 3 Lennart Poettering 2009-08-05 22:02:32 UTC

I now added a quirk upstream for this:

http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=4fdaf003a3b7277c1f3aec45d52c362f6aa187bc

Comment 4 Oded Arbel 2009-08-06 12:17:18 UTC

I don't think this fix is correct - liatasmart should *completely ignore* the real values reported for the purpose of determining if the state of the drive is in error or not.

In order to decide if the drive is in failure status, libatasmart should only consult the "Current value" vs. "Threshold", according to the logic described in man smartctl. If the drive considers 1900724 re-allocated sectors is a value of 100 vs a threshold of 5, then libatasmart should not second guess that.

Comment 5 Oded Arbel 2009-08-06 12:20:26 UTC

I forgot to mention that as I described in comment #47 bug #498115 , libatasmart describes even 1 reallocated sector as a fail status - which is obviously wrong. Adding a quirk for every such case is not the write way to go.

Comment 6 Lennart Poettering 2009-08-06 15:59:37 UTC

We decided that it makes more sense to actually second-guess the drive here. Manufacturers tend to set those threshold artifically high, to make their drives look better.

However you are right, checking against 0 is a bit too much. This will be changed to check against a threshold that is dependant on the actual size of the disk.

Comment 7 Oded Arbel 2009-08-07 08:35:44 UTC

I think that second guessing the manufacturer is not a good idea on the face of it - unless proven that as a rule of thumb implementors of the SMART standard cannot be relied upon to implement it properly, I would think that the default implementation in Linux should be to work with the standard.

Baring that this is not likely to happen (and I acknowledge that there may be issues with some drives), I would hope that there would be some switch that allows me to ask libatasmart to honor the standard and the device manufacturer settings, on my machine.

Comment 8 Lennart Poettering 2009-08-07 16:07:27 UTC

(In reply to comment #7)
> I think that second guessing the manufacturer is not a good idea on the face of
> it - unless proven that as a rule of thumb implementors of the SMART standard
> cannot be relied upon to implement it properly, I would think that the default
> implementation in Linux should be to work with the standard.

Nice idea. However, that doesn't work. For the simple reason that there is no "SMART standard". There is simply no official spec of the SMART attributes stuff. There was a draft spec which was pulled back. Most vendors do follow that but departed from that in many many ways, sometimes in a compatible way, sometimes in an incompatible way. A good part of the information libatasmart parses is not documented anywhere, it's simply something that was observed that all (or some) manufacturars seem to agree on or follow, even if it isn't set in stone.

libatasmart tries to make sense of the data available in the SMART information as good as it can. But since there is no official specification we need to take the data that is a available, distill some information from it, verify that it makes sense and then present that to the user.
 
Also note that libatasmart .14 now compares the number of bad sectors against a threshold that depends on the disk size.

Comment 9 Oded Arbel 2009-08-08 09:44:14 UTC

Ok, thanks for the response. 

I still think there is room in palimpsest for showing where libatasmart thinks there is a failure while the manufacturer's "current value">"threshold value" says its OK(*) and in such cases allow the user to override the notification for just that property so that it reports failures according the the manufacturer's values.

(*) even if these values are stupid, like setting the current value to always 100.

Note You need to log in before you can comment on or make changes to this bug.