Bug 498115

Summary:

gnome-disk-utility notifies of a 'failing' disk when values are still above manufacturer threshold (bad sectors, temperature)

Product:

[Fedora] Fedora

Reporter:

Antonio A. Olivares <olivares14031>

Component:

libatasmart

Assignee:

Lennart Poettering <lpoetter>

Status:

CLOSED WONTFIX

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

2011xuesong, antonio.montagnani, awilliam, bugzilla, chalsall, chancebrohm, davidz, d.gavrilovic, gczarcinski, infertux, jeroen, jfrieben, jhaar, jim.cromie, kevinverma, lpoetter, mfuruta, mgoodwin, mishu, oded, robatino, sawrub, stuart, synchron, thomas, torsten, youarecrazydude, yulrottmann

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

707841 (view as bug list)

Environment:

Last Closed:

2013-02-14 00:44:54 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

707841

Attachments:

Description	Flags
devkit dump for jhaar	none
skdump for jhaar	none
Output of: # devkit-disks --show-info /dev/sda	none
Output of: # skdump /dev/sda	none
Output of # smartctl -a /dev/sda	none
Screenshot of palimpsest	none
Patch for libatasmart-0.12 to remove special handling for reallocated-sector-count attribute	none
Patch for gnome-disk-utility to remove special handling for reallocated-sector-count attribute	none
output-1-devkit-disks.txt	none
output-2-skdump-sda.txt	none
smartctl report for my "failing" drive	none
skdump of my "failing" disk	none
devkit-disks dump of my "failing" disk	none
devkit-disks dump	none
Output of smartctl -a /dev/sda	none
skdump for my failing drive test Model: [Hitachi HTS541612J9SA00]	none
devkit-disks report for /dev/sdb	none
updated revised output includes skdump and libatasmart version.	none
Message show by GDU	none
devkit-disks	none
skdump	none
Adapted from Ubuntu libatasmart package	none
Screenshot of palimpsest with incorrect bad sector count	none

Description Antonio A. Olivares 2009-04-28 22:33:18 UTC

Description of problem:
Everytime I login to Gnome, I get the message --> One or more disks are failing :(

Version-Release number of selected component (if applicable):
gnome-disk-utility-0.3-0.5.20090415git.fc11.x86_64

[olivares@gray ~]$ yum whatprovides /usr/libexec/gdu-notification-daemon
Loaded plugins: refresh-packagekit
Importing additional filelist information
gnome-disk-utility-0.3-0.5.20090415git.fc11.x86_64 : Disk management application
Repo : rawhide
Matched from:
Filename : /usr/libexec/gdu-notification-daemon

gnome-disk-utility-0.3-0.5.20090415git.fc11.x86_64 : Disk management application
Repo : installed
Matched from:
Other : Provides-match: /usr/libexec/gdu-notification-daemon

How reproducible:
Login in to GNOME,if you use KDE you need not worry :)

Steps to Reproduce:
1. Login to GNOME
2. If you are luck/unluck you will see the disk utility telling you that one or more disk is failing :(
3. repeat steps 1 and 2 if needed

Actual results:
Not to complain or tell me that my disk is failing me when it is not true. My machine is brand new, how can the disk fail? I don't understand. I found two workarounds

1) Use KDE
2) Disable the thing from starting up using the sessions.

Disk Notifications
/usr/libexec/gdu-notification-daemon --delay
Provides notifications related to disks

Expected results:
For it not to complain.

Additional info:
Upon Request. ***NOTE*** I might not be able to respond quickly to mails suggesting more info. My rawhide machines are at school and they might close due to the SWINE FLU. Upon further notice. A double WHAMMY since we are also administering STATEWIDE TAKS EXAMS. Thank you for your consideration.

[root@gray ~]# fdisk -l

Disk /dev/sda: 160.0 GB, 160040803840 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x296c296c

Device Boot Start End Blocks Id System
/dev/sda1 * 1 26 204800 83 Linux
Partition 1 does not end on cylinder boundary.
/dev/sda2 26 19457 156083521 8e Linux LVM

Disk /dev/dm-0: 157.9 GB, 157915545600 bytes
255 heads, 63 sectors/track, 19198 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/dm-0 doesn't contain a valid partition table

Disk /dev/dm-1: 1912 MB, 1912602624 bytes
255 heads, 63 sectors/track, 232 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Disk identifier: 0x00000000

Disk /dev/dm-1 doesn't contain a valid partition table
[root@gray ~]#

Comment 1 David Zeuthen 2009-04-28 22:50:29 UTC

Have you considered that maybe your disk is really failing? Anyway, there's the odd chance this is a false positive especially since we just added these bits and pieces to the OS. Please include the output of

 - devkit-disks --show-info /dev/sda
 - skdump /dev/sda (as root)
 - rpm -q libatasmart

Thanks.

Comment 2 Antonio A. Olivares 2009-04-29 00:38:35 UTC

Have you considered that maybe your disk is really failing?

No, I don't consider it.  I could hear sounds, sounds that can tell me that the hard drive is failing.  On two or three machines which I run rawhide, it(the disk utility) is telling me that the disks are failing.  It can't be true, I have updated the machines without problems, I have not gotten any read/write errors.  How can it be?

Anyway, I don't know how soon I can send you the input that you have asked for.  At the place I work at, the school might be closed due to the swine flu virus, so I don't know if I'll be reporting for work, if I do, then I'll try to do what you have requested.  

They closed down a Middle School since yesterday
http://www.valleycentral.com/news/news_story.aspx?id=292465

and today they told the students and staff that the school was being closed down upon further notice?

http://www.valleycentral.com/news/news_story.aspx?id=292942

Thank you for your consideration.  I'll send the information as soon as I can.

Comment 3 Antonio A. Olivares 2009-04-29 11:47:41 UTC

I guess I might not be able to respond back till Monday :(, hope that no one gets upset because of this.

http://www.krgv.com/news/local/story/Rio-Grande-City-I-S-D-Closes-All-Schools-Until/ZAj96n-QEU2a3qbdslccPg.cspx

Comment 4 Jason Haar 2009-05-04 02:25:40 UTC

This is a "me too".

I've just installed FC11-Preview and I'm seeing this too. I had FC10 on my Dell D430 laptop, reinstalled/trashed it for FC11, and upon first logon it's telling me my disk is going South. Not impossible - but not probable either. FYI I'm using the full disk encryption option (ie /boot unencrypted and the rest encrypted, with "/" and swap on top of that) if that makes a difference. Strangely enough I don't see smartd running? Didn't that used to be activated by default? 

I'll attach the data you asked the original person for.

rpm -q libatasmart
libatasmart-0.12-2.fc11.i586

Comment 5 Jason Haar 2009-05-04 02:26:45 UTC

Created attachment 342266 [details]
devkit dump for jhaar

Comment 6 Jason Haar 2009-05-04 02:27:13 UTC

Created attachment 342267 [details]
skdump for jhaar

Comment 7 Steve 2009-05-07 07:24:12 UTC

Created attachment 342781 [details]
Output of: # devkit-disks --show-info /dev/sda

Same problem here on Fedora-11-Preview.

libatasmart-0.12-2.fc11.i586

Comment 8 Steve 2009-05-07 07:24:50 UTC

Created attachment 342782 [details]
Output of: # skdump /dev/sda

Comment 9 Jason Haar 2009-05-07 07:42:07 UTC

Hi there

I just noticed; my skdump shows "Powered On: 6.7 months" on my laptop (uptime:2h) and Steve above states his PC has been powered on for 2.7 years. As FC11 hasn't been out that long, I don't think it's correct!

That along with the 242458743 bad/corrupt sectors must be plain wrong. I don't think I'd be able to type this in if it were true ;-)

Jason

Comment 10 Antonio A. Olivares 2009-05-07 12:13:25 UTC

OK,

Information requested:

[olivares@gray Documents]$ devkit-disks --show-info /dev/sda
Showing information for /org/freedesktop/DeviceKit/Disks/devices/sda
  native-path:             /sys/devices/pci0000:00/0000:00:08.0/host1/target1:0:0/1:0:0:0/block/sda                                                             
  device:                  8:0                                                  
  device-file:             /dev/sda                                             
    by-id:                 /dev/disk/by-id/ata-WDC_WD1600JS-00NCB1_WD-WMANM8000418                                                                              
    by-id:                 /dev/disk/by-id/scsi-SATA_WDC_WD1600JS-00_WD-WMANM8000418                                                                            
    by-path:               /dev/disk/by-path/pci-0000:00:08.0-scsi-1:0:0:0      
  detected at:             Thu 07 May 2009 07:11:07 AM CDT                      
  system internal:         1                                                    
  removable:               0                                                    
  has media:               1                                                    
    detects change:        0                                                    
    detection by polling:  0                                                    
    detection inhibitable: 0                                                    
    detection inhibited:   0                                                    
  is read only:            0                                                    
  is mounted:              0                                                    
  mount paths:                                                                  
  mounted by uid:          0                                                    
  presentation hide:       0                                                    
  presentation name:                                                            
  presentation icon:                                                            
  size:                    160040803840                                         
  block size:              512                                                  
  job underway:            no                                                   
  usage:                                                                        
  type:                                                                         
  version:                                                                      
  uuid:                                                                         
  label:                                                                        
  partition table:                                                              
    scheme:                mbr                                                  
    count:                 2                                                    
  drive:                                                                        
    vendor:                ATA                                                  
    model:                 WDC WD1600JS-00N                                     
    revision:              10.0                                                 
    serial:                WD-WMANM8000418                                      
    ejectable:             0                                                    
    require eject:         0                                                    
    media:                                                                      
      compat:                                                                   
    interface:             ata                                                  
    if speed:              (unknown)                                            
    ATA SMART:             Updated at Thu 07 May 2009 07:11:07 AM CDT           
      assessment:          PASSED                                               
      bad sectors:         None                                                 
      attributes:          One ore more attributes exceed threshold             
      temperature:         54° C / 129° F                                       
      powered on:          1.25 days                                            
      offline data:        suspended (4980 second(s) to complete)               
      self-test status:    success or never (0% remaining)                      
      ext./short test:     available                                            
      conveyance test:     available                                            
      start test:          available                                            
      abort test:          available                                            
      short test:            2 minute(s) recommended polling time               
      ext. test:            60 minute(s) recommended polling time               
      conveyance test:       6 minute(s) recommended polling time               
=============================================================================== 
 Attribute       Current/Worst/Threshold  Status   Value       Type     Updates 
=============================================================================== 
 raw-read-error-rate         200/200/ 51   good    0           Prefail  Online  
 spin-up-time                208/186/ 21   good    2.58 secs   Prefail  Online  
 start-stop-count            100/100/  0    n/a    54          Old-age  Online  
 reallocated-sector-count    200/200/140   good    0 sectors   Prefail  Online  
 seek-error-rate             200/200/ 51   good    0           Prefail  Online  
 power-on-hours              100/100/  0    n/a    1.25 days   Old-age  Online  
 spin-retry-count            100/253/ 51   good    0           Prefail  Online  
 calibration-retry-count     100/253/ 51   good    0           Old-age  Online  
 power-cycle-count           100/100/  0    n/a    54          Old-age  Online  
 airflow-temperature-celsius  46/ 37/ 45   FAIL    54C / 129F  Old-age  Online  
 temperature-celsius-2        93/ 84/  0    n/a    54C / 129F  Old-age  Online  
 reallocated-event-count     200/200/  0    n/a    0           Old-age  Online  
 current-pending-sector      200/200/  0    n/a    0 sectors   Old-age  Online  
 offline-uncorrectable       100/253/  0    n/a    0 sectors   Old-age  Offline 
 udma-crc-error-count        200/200/  0    n/a    0           Old-age  Online  
 multi-zone-error-rate       100/253/ 51   good    0           Prefail  Offline 
[olivares@gray Documents]$ su -
Password:                      
[root@gray ~]# skdump /dev/sda (as root)
-bash: syntax error near unexpected token `('
[root@gray ~]# skdump /dev/sda
Device: /dev/sda              
Type: 16 Byte SCSI ATA SAT Passthru
Size: 152626 MiB                   
Model: [WDC WD1600JS-00NCB1]       
Serial: [WD-WMANM8000418]          
Firmware: [10.02E02]               
SMART Available: yes               
Quirks:                            
Awake: yes                         
SMART Disk Health Good: yes        
Off-line Data Collection Status: [Off-line data collection activity was suspended by an interrupting command from host.]                                        
Total Time To Complete Off-Line Data Collection: 4980 s                         
Self-Test Execution Status: [The previous self-test routine completed without error or no self-test has ever been run.]                                         
Percent Self-Test Remaining: 0%                                                 
Conveyance Self-Test Available: yes                                             
Short/Extended Self-Test Available: yes                                         
Start Self-Test Available: yes                                                  
Abort Self-Test Available: yes                                                  
Short Self-Test Polling Time: 2 min                                             
Extended Self-Test Polling Time: 60 min                                         
Conveyance Self-Test Polling Time: 6 min                                        
Bad Sectors: 0 sectors                                                          
Powered On: 1.2 days                                                            
Power Cycles: 54                                                                
Average Powered On Per Power Cycle: 33.3 min                                    
Temperature: 52.0 C                                                             
Overall Status: GOOD                                                            
ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good                                                               
  1 raw-read-error-rate         200   200    51   0           0x000000000000 prefail online  yes                                                                
  3 spin-up-time                208   186    21   2.6 s       0x170a00000000 prefail online  yes                                                                
  4 start-stop-count            100   100     0   54          0x360000000000 old-age online  n/a                                                                
  5 reallocated-sector-count    200   200   140   0 sectors   0x000000000000 prefail online  yes                                                                
  7 seek-error-rate             200   200    51   0           0x000000000000 prefail online  yes                                                                
  9 power-on-hours              100   100     0   1.2 days    0x1e0000000000 old-age online  n/a                                                                
 10 spin-retry-count            100   253    51   0           0x000000000000 prefail online  yes                                                                
 11 calibration-retry-count     100   253    51   0           0x000000000000 old-age online  yes
 12 power-cycle-count           100   100     0   54          0x360000000000 old-age online  n/a
190 airflow-temperature-celsius  48    37    45   52.0 C      0x340000000000 old-age online  no
194 temperature-celsius-2        95    84     0   52.0 C      0x340000000000 old-age online  n/a
196 reallocated-event-count     200   200     0   0           0x000000000000 old-age online  n/a
197 current-pending-sector      200   200     0   0 sectors   0x000000000000 old-age online  n/a
198 offline-uncorrectable       100   253     0   0 sectors   0x000000000000 old-age offline n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a
200 multi-zone-error-rate       100   253    51   0           0x000000000000 prefail offline yes
[root@gray ~]# rpm -q libatasmart
libatasmart-0.12-2.fc11.x86_64


Thanks,

Comment 11 Antonio A. Olivares 2009-05-07 12:15:26 UTC

I am submitting this as I might not have X anymore :(

other bug over here:

https://bugzilla.redhat.com/show_bug.cgi?id=499570

Comment 12 Steve 2009-05-14 13:17:22 UTC

Please change Version to F11..

Comment 13 Andre Robatino 2009-06-09 04:23:47 UTC

Every time I log into F11 (-rc4), I am given the same severe warning that my HDD is failing.  It turns out that the cause of the warning is 1 reallocated sector.  Saved smartctl output from 6 months ago shows it wasn't there then.  Still, it seems overkill to get the same loud warning whether I've had the same one bad sector for months, or if the number is increasing every day.  I think it makes more sense to give a single loud warning each time something changes, or if SMART indicates imminent failure (even now the overall test result is "PASSED").  The present warning is like crying wolf - it will eventually be ignored, so the user won't notice when the drive is really heading south.

Comment 14 Andre Robatino 2009-06-09 04:26:01 UTC

Created attachment 346963 [details]
Output of # smartctl -a /dev/sda

Comment 15 David Zeuthen 2009-06-09 13:25:00 UTC

(In reply to comment #13)
> Every time I log into F11 (-rc4), I am given the same severe warning that my
> HDD is failing.  It turns out that the cause of the warning is 1 reallocated
> sector.  Saved smartctl output from 6 months ago shows it wasn't there then. 
> Still, it seems overkill to get the same loud warning whether I've had the same
> one bad sector for months, or if the number is increasing every day.  I think
> it makes more sense to give a single loud warning each time something changes,
> or if SMART indicates imminent failure (even now the overall test result is
> "PASSED").  The present warning is like crying wolf - it will eventually be
> ignored, so the user won't notice when the drive is really heading south.  

Well, the old saying is that "you can't be a little pregnant". Either the disk is failing or it's not; it's not like disks magically get better over time. That said, there's already a bug about this upstream, see

 http://bugzilla.gnome.org/show_bug.cgi?id=579873

Comment 16 Andre Robatino 2009-06-09 13:38:38 UTC

Strictly speaking, all hardware is failing - whether to replace it is a question of what the expected time to failure is, and how likely it is that it will fail without adequate warning.  I have smartctl output from an old 8 GB HDD that I used for almost 10 years.  It shows 373 reallocated sectors.  I was never aware of the problem, since I only paid attention to SMART's assessment of "PASSED".  But chances are that they accumulated over years.  The drive was still working when I replaced it with a bigger one.

Comment 17 Bug Zapper 2009-06-09 14:44:06 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 18 Jeroen Beerstra 2009-06-21 14:31:31 UTC

This is a me2.

The problem is there are really 107 reallocated sectors on the Samsung 750Gb F1 disk in question. I could explain that like this: shortly after I first started using the disk there was a massive data lost and I lost the complete dmraid0 partition on that disk and another identical disk, as it turned out there was no means of recovery. So in the end after Samsungs ES-Tool indeed found bad sectors (CRC error IIRC), I did a complete low-level format of the drive and started over, never had a single problem since. This could very well account for the reallocated sectors, but I'm not 100% sure about this as I haven't got I/O errors related to this disk, not even when I did a complete dd copy to img file on another larger disk.

However to make a large story short: I'm unable to perform a short SMART selftest, could this be the real problem?

Jun 21 16:23:31 morphius kernel: ata4: EH in SWNCQ mode,QC:qc_active 0x3 sactive 0x3
Jun 21 16:23:31 morphius kernel: ata4: SWNCQ:qc_active 0x1 defer_bits 0x2 last_issue_tag 0x0
Jun 21 16:23:31 morphius kernel:  dhfis 0x1 dmafis 0x1 sdbfis 0x0
Jun 21 16:23:31 morphius kernel: ata4: ATA_REG 0x40 ERR_REG 0x0
Jun 21 16:23:31 morphius kernel: ata4: tag : dhfis dmafis sdbfis sacitve
Jun 21 16:23:31 morphius kernel: ata4: tag 0x0: 1 1 0 1  
Jun 21 16:23:31 morphius kernel: ata4.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
Jun 21 16:23:31 morphius kernel: ata4.00: cmd 60/28:00:75:62:72/00:00:1c:00:00/40 tag 0 ncq 20480 in
Jun 21 16:23:31 morphius kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 21 16:23:31 morphius kernel: ata4.00: status: { DRDY }
Jun 21 16:23:31 morphius kernel: ata4.00: cmd 60/60:08:8d:c3:cc/00:00:26:00:00/40 tag 1 ncq 49152 in
Jun 21 16:23:31 morphius kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 21 16:23:31 morphius kernel: ata4.00: status: { DRDY }
Jun 21 16:23:31 morphius kernel: ata4: hard resetting link
Jun 21 16:23:32 morphius kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jun 21 16:23:32 morphius kernel: ata4.00: configured for UDMA/133
Jun 21 16:23:32 morphius kernel: ata4: EH complete
Jun 21 16:23:32 morphius kernel: sd 3:0:0:0: [sdc] 1465149168 512-byte hardware sectors: (750 GB/698 GiB)
Jun 21 16:23:32 morphius kernel: sd 3:0:0:0: [sdc] Write Protect is off
Jun 21 16:23:32 morphius kernel: sd 3:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I can provide more info if needed.

Comment 19 Jeroen Beerstra 2009-06-21 14:32:54 UTC

Created attachment 348792 [details]
Screenshot of palimpsest

Comment 20 Jeroen Beerstra 2009-06-21 14:40:33 UTC

Some additional info:

1 "#sktest /dev/sdb short" does seem to finish without errors:

# smartctl -a /dev/sdb
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD753LJ
Serial Number:    S13UJ1NQ208785
Firmware Version: 1AA01108
User Capacity:    750,156,374,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Sun Jun 21 16:34:24 2009 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (10890) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 182) minutes.
Conveyance self-test routine
recommended polling time: 	 (  20) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   078   078   011    Pre-fail  Always       -       7370
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       658
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       10178
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       3467
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       647
 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   069   068   000    Old_age   Always       -       31 (Lifetime Min/Max 31/31)
194 Temperature_Celsius     0x0022   069   066   000    Old_age   Always       -       31 (Lifetime Min/Max 30/33)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       14575253
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3467         -
# 2  Short offline       Aborted by host               00%      3377         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

2 Under Windows I can perform a SMART short selftest on sdc with smartctl without problems. Will verify this just to make sure.

Comment 21 Jeroen Beerstra 2009-06-21 15:31:07 UTC

smartctl -t short sdc under Windows XP did indeed not return any errors:

# smartctl -a /dev/sdc
smartctl version 5.38 [x86_64-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG HD753LJ
Serial Number:    S13UJ1KQ323989
Firmware Version: 1AA01109
User Capacity:    750,156,374,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Sun Jun 21 17:26:45 2009 CEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (11236) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 188) minutes.
Conveyance self-test routine
recommended polling time: 	 (  20) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   099   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   078   078   011    Pre-fail  Always       -       7460
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       660
  5 Reallocated_Sector_Ct   0x0033   098   098   010    Pre-fail  Always       -       107
  7 Seek_Error_Rate         0x000f   100   100   051    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail  Offline      -       10082
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       3467
 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age   Always       -       1
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       647
 13 Read_Soft_Error_Rate    0x000e   100   099   000    Old_age   Always       -       0
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0033   100   100   099    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       496
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   070   069   000    Old_age   Always       -       30 (Lifetime Min/Max 30/31)
194 Temperature_Celsius     0x0022   069   068   000    Old_age   Always       -       31 (Lifetime Min/Max 30/33)
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       21006000
196 Reallocated_Event_Count 0x0032   099   099   000    Old_age   Always       -       39
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision number = 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      3466         -
# 2  Short offline       Aborted by host               20%      3466         -
# 3  Short offline       Aborted by host               00%      3376         -
# 4  Extended offline    Aborted by host               80%      3376         -
# 5  Short offline       Aborted by host               20%      3376         -
# 6  Short offline       Aborted by host               20%      3376         -

SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1
SMART Selective self-test log data structure revision number 0
Warning: ATA Specification requires selective self-test log data structure revision number = 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Comment 22 Alex Butcher 2009-06-21 18:35:43 UTC

Created attachment 348808 [details]
Patch for libatasmart-0.12 to remove special handling for reallocated-sector-count attribute

Comment 23 Alex Butcher 2009-06-21 18:36:29 UTC

Created attachment 348809 [details]
Patch for gnome-disk-utility to remove special handling for reallocated-sector-count attribute

Comment 24 Alex Butcher 2009-06-21 18:38:21 UTC

I've attached a couple of trivial patches that stop libatasmart and gnome-disk-utility from crying wolf over >0 reallocated sectors. Hopefully, they don't have any unanticipated side-effects, but use at your own risk.

Comment 25 Jeroen Beerstra 2009-06-21 19:04:17 UTC

Isn't the real problem that I can't perform a SMART short selftest, or is this an unrelated problem?

Comment 26 Alex Butcher 2009-06-21 20:12:34 UTC

Jeroen, this BZ entry is about gdu-notification-daemon reporting that disks are failing for various reasons. In your case, it's because your drive has previously reallocated 107 sectors.

I guess some drives may not offer all SMART tests. To be honest, I'm not sure that the short test is of much value; I suspect it's just a basic function test on the mechanical components and the electronics, since it completes far too quickly to do much useful testing (compared with the long and selective tests). You might like to look into changing your SATA cables though; I had regular disconnects due to routine SMART probes that went away when I replaced my SATA cables with locking versions.

Comment 27 Jeroen Beerstra 2009-06-21 20:43:58 UTC

Thanks, the only problem I have with this drive at the moment is that SMART selftests don't complete under F11, that is with disconnect kernel messages and all. No disconnects or other I/O errors during normal operation which includes heavy I/O, and the same test completes just fine under Windows XP...

So I was wondering if this could be related to the problem mentioned in this BZ entry, that's  all.

The other thing worth mentioning is that it took some time before this "Your drive is failing" message started to appear. As a matter of fact I just replaced a truly faulty drive, so timing could not have been worse ;)

Comment 28 Andre Robatino 2009-06-22 01:30:47 UTC

It would be nice if Palimpsest remembered the previous values of attributes such as reallocated-sector-count and gave a one-time warning to each user if any of them has changed - but not claim that the disk is failing unless smart itself believes that.  That way I wouldn't have to check every single day whether my reallocated-sector-count has increased from 1.

Comment 29 Andre Robatino 2009-06-22 02:10:04 UTC

It would be even nicer if Palimpsest could remember the time of each previous increase in one of the dangerous attributes, so one could have a long-term record of the gradual degradation of the disk.  After doing a clean install of each new version of Fedora, one could restore the previous record from backups to preserve an unbroken record from when the disk was first installed.

Comment 30 Alex Butcher 2009-06-22 09:55:07 UTC

I like Andre's proposal. I think it would be quite acceptable for a newly-installed gdu/Palimpsest to give a flurry of alerts, but there should be an 'acknowledge' option for each. From then on, changes from the previously-acknowledged value would result in a warning alert (unless the user has asked to ignore changes of less than a given amount - e.g. for attributes like temperature, high fly writes, ECC recovered which can wander up and down), and a high-profile alert if the vendor threshold is crossed.

Comment 31 Chris Halsall 2009-06-24 19:53:54 UTC

@David Zeuthen: "Have you considered that maybe your disk is really failing?

As someone who has been using Live releases of F11 for a little while now (and have been running FC since FC existed...), I was a little taken aback by the hell I encountered installing F11 on my primary workstation...

I too encountered scary messages about my disk failing.

I too clicked on the Icon, as instructed.

I too received no information.

Until I finally clued in that after clicking the Icon, I then had to click on the little yellow bar below to give the details.  If I clicked on the Icon again (double click) the yellow bar disappeared...

Might I please suggest that when one double clicks on the Icon, that something more informative appears that a simple yellow bar giving the HD's specifics?

Perhaps when one double clicks on the Icon, perhaps this should result in the same behaviour as that which results when one clicks on the Icon, and *then* clicks on the yellow bar which appears below?

Comment 32 Gene Czarcinski 2009-06-26 18:18:59 UTC

Count this as a vote to somehow change palimpsest so that it only warns of a failing disk if the bad sector count increases or if it exceeds some nominal small number.

These days, disk drive manufacturers sell drives with a "small" number of bad sectors simply because, to do otherwise, would result in much higher prices for disks.  If the number of remapped/relocated sectors is small, there should be no measurable performance hit.

Comment 33 Chris Halsall 2009-06-26 19:47:49 UTC

Thank you Czarcinski...

And to take this a bit further...

In my Icon bar I still have a scary icon in my icon bar. It reminds me of those who run WinBlows, and haven't paid for Virus protection ("You might be at risk!!! Pay now!!!"). (Read: A hard-disk icon with an orange caution triangle around an exclamation symbol.)

If you click on this icon, you are presented with a light-yellow bar showing the hard-drive in question. If you click on that you are presented with the "Palimpsest Disk Utility" window. (If you instead click again on the Icon, the light-yellow bar disappears, and you're wondering why the double-click didn't work.)

If you then click on the "Details button" you are then presented with a very impressive window titled "ATA SMART Attributes", which includes a time-value domain section of the window which contains (surprise!) NO DATA!!!!

In my particular case, the spread-sheet like data below the graph contains in row #5 (note that there is no row #6 nor #8) "Attribute: Reallocated Sector Count", "Current: 192", "Worst: 192", "Threshold: 140", "Value: 24 Sectors", "Status: FAILING".

Please note that I have been watching this *very* closely since I installed F11 on my primary workstation.

Please note that I have not observed a single change in these values. (Nor have I observed anything I should be concerned about in my "/var/log/*".)

If I May please suggest that if the Open Source Community wish to gain users, it might be a good idea if they don't freak out those unfamiliar who try to use the OS/distribution.

I agree with Czarcinski above. It would be worthwhile for the new Install to bring to the attention of the User the OS' perspective of any risk; of where it finds itself. But it is *not* a good idea to freak out new Users if where the new OS finds itself is nominal.

Kindest regards to all.

Comment 34 Kevin Verma 2009-07-01 16:20:26 UTC

I face the same issue on a Dell Latitude D430 with Fedora 11 :-( 

And I doubt this disk is really failing, I think I will like to have this isolated first. So I am providing more infromation to you in attachments. 

# devkit-disks --show-info /dev/sda > /tmp/output-1-devkit-disks.txt
# skdump /dev/sda > /tmp/output-2-skdump-sda.txt
# rpm -q libatasmart
libatasmart-0.12-3.fc11.x86_64

Regards,
Kevin Verma

Comment 35 Kevin Verma 2009-07-01 16:22:27 UTC

Created attachment 350143 [details]
output-1-devkit-disks.txt

Comment 36 Kevin Verma 2009-07-01 16:23:32 UTC

Created attachment 350144 [details]
output-2-skdump-sda.txt

Comment 37 Alex Butcher 2009-07-01 17:13:02 UTC

Kevin -

Actually your skdump output indicates that your disc may well have some fairly serious problems (the current-pending-sector and offline-uncorrectable attributes).

Mind you, 26GiB of uncorrectable sectors and 274GiB of pending sectors sounds somewhat unlikely. Seems like your drive's SMART function has lost its mind.

I'd back up, then run a SMART long test and see what happens to those two counts. If there's no significant change, I'd get in touch with Samsung and see what they say; maybe there's a firmware upgrade available?

Comment 38 Chris Halsall 2009-07-01 18:52:47 UTC

OK.  Let me please present my data:

For sda (primary HD on a Dell M65):

  1 raw-read-error-rate         200   200    51   0           0x000000000000 prefail online  yes
  3 spin-up-time                190   188    21   1.5 s       0xd30500000000 prefail online  yes
  4 start-stop-count            100   100     0   195         0xc30000000000 old-age online  n/a
  5 reallocated-sector-count    197   197   140   24 sectors  0x180000000000 prefail online  no 
  7 seek-error-rate             200   200    51   0           0x000000000000 old-age online  yes
  9 power-on-hours               79    79     0   1.8 years   0xd13c00000000 old-age online  n/a
 10 spin-retry-count            100   100    51   0           0x000000000000 old-age online  yes
 11 calibration-retry-count     100   100    51   0           0x000000000000 old-age online  yes
 12 power-cycle-count           100   100     0   164         0xa40000000000 old-age online  n/a
192 power-off-retract-count     200   200     0   161         0xa10000000000 old-age online  n/a
193 load-cycle-count            115   115     0   257218      0xc2ec03000000 old-age online  n/a
194 temperature-celsius-2       104    96     0   43.0 C      0x2b0000000000 old-age online  n/a
196 reallocated-event-count     193   193     0   7           0x070000000000 old-age online  n/a
197 current-pending-sector      200   200     0   0 sectors   0x000000000000 old-age online  n/a
198 offline-uncorrectable       100   253     0   0 sectors   0x000000000000 old-age offline n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a
200 multi-zone-error-rate       200   200    51   0           0x000000000000 old-age offline yes

For sdb (a USB 500 gig drive purchased in 2008.12:

ID# Name                        Value Worst Thres Pretty      Raw            Type    Updates Good
  1 raw-read-error-rate         118    99     6   187107384   0x3808270b0000 prefail online  yes
  3 spin-up-time                 92    90     0   0 ms        0x000000000000 prefail online  n/a
  4 start-stop-count            100   100    20   147         0x930000000000 old-age online  yes
  5 reallocated-sector-count    100   100    36   1 sectors   0x010000000000 prefail online  no 
  7 seek-error-rate              65    60    30   3226135     0x173a31000000 prefail online  yes
  9 power-on-hours               96    96     0   5.8 months  0x671000000000 old-age online  n/a
 10 spin-retry-count            100   100    97   0           0x000000000000 prefail online  yes
 12 power-cycle-count           100   100    20   7           0x070000000000 old-age online  yes
184 attribute-184               100   100    99   n/a         0x000000000000 old-age online  yes
187 reported-uncorrect          100   100     0   0 sectors   0x000000000000 old-age online  n/a
188 attribute-188                96    96     0   n/a         0x080008000500 old-age online  n/a
189 high-fly-writes             100   100     0   0           0x000000000000 old-age online  n/a
190 airflow-temperature-celsius  57    42    45   43.0 C      0x2b001b396202 old-age online  no 
194 temperature-celsius-2        43    58     0   43.0 C      0x2b0000001200 old-age online  n/a
195 hardware-ecc-recovered       35    32     0   187107384   0x3808270b0000 old-age online  n/a
197 current-pending-sector      100   100     0   0 sectors   0x000000000000 old-age online  n/a
198 offline-uncorrectable       100   100     0   0 sectors   0x000000000000 old-age offline n/a
199 udma-crc-error-count        200   200     0   0           0x000000000000 old-age online  n/a

Comment 39 Alex Butcher 2009-07-01 20:15:27 UTC

Chris -

In both cases, I believe the reason why both drives will be reported as failing is that both libsmart and the gdu-notification-daemon handle the reallocated-sectors-count and pending-sectors attributes with a special case; they report FAILING if the raw attribute is >0, rather than <$THRESHOLD as they do for every other attribute.

I believe this is in error for reallocated-sectors-count, but I can see why pending-sectors > 0 raises an alert (since this requires some administrator action to investigate and/or force the drive to re-write (and maybe reallocate) the sector by writing to it.

Comment 40 Chris Halsall 2009-07-01 21:11:43 UTC

Alex...

Thank for your above...

But please agree with me that in my above two empirical reports the "current-pending-sector" value is zero.

As in, there's nothing actually wrong.

And yet, F11 (falsely) tells me (and the new user) that they're (we're) in serious danger of imminent disaster.  

Many will receive this confusing and worrying report.  (Sounds a bit like FOX News... (wink))

Hummm....

Comment 41 Chris Halsall 2009-07-01 21:26:16 UTC

And, actually Alex, having had a moment to think about it, I question why libsmart and gdu-notification-daemon should take the liberty of interpreting the empirical values differently than the $THRESHOLD.

The drives are actually reporting that everything is nominal.

F11 (et al) are raising scary notices.

Could I perhaps suggest that instead F11's software tell the user something along the lines of "your drives are saying they're probably fine, but with some minor, nominal, and expected errors.  Please let us help you collect temporal data to determine if you might wish to consider replacing your drives, just to be safe.  Click here...."

Just putting that out there....

Comment 42 Jason Haar 2009-07-01 22:00:24 UTC

Here's a slightly different take on all this.

It does appear the SMART controller is returning obviously bogus information - eg my Samsung disk (notice the correlation...) is reporting that it's been up for 198 days when it's only been up for 2 hours. So this is a case of "garbage in, garbage out".

So then it becomes a design issue: do you want devkit to be going nuts and reporting "errors" on every SMART disk out there that is poorly implemented? I'd say it would make sense for devkit to sanity-check some of the data it's getting back, and if it's obviously broken, report to the user that their disk is running a broken SMART implementation and therefore devkit is *going to stop monitoring it*

For myself, I "fixed" this problem by disabling smart - I just think devkit should figure that out for itself ;-)

Jason

Comment 43 Alex Butcher 2009-07-01 23:37:44 UTC

Jason -

As I said, the drives are being reported as failing because they both have reallocated-sectors-count > 0. My mention of the pending-sectors attribute was for completeness.

I've emailed the author of libsmart privately to ask why it does this a few weeks back, with no response, and there's no response from the developer/packager of gdu here. You'll see I've even attached patches to calm down libsmart and gdu, but still nothing from the people who can actually fix it in F11.

Incidentally, some Windows SMART software takes similar liberties in the same way as libsmart and gdu. Some people obviously feel that's correct behaviour. Maybe we're just banging our heads against a brick wall, in that case.

Comment 44 Chris Halsall 2009-07-01 23:59:27 UTC

Alex...

Perhaps a better question to ask is why is F11 presenting to its users (by packages written by unresponsive authors) inappropriately alarmist notices?

Just how much QA went into gdu?  libsmart?

Jason...

Rather than simply disabling what *might* be valuable information, might it not make more sense for the software to know that it might be lied to by the hardware, and inform the user what it is able to empirically determine over time?

(Assuming, of course, that the software (and the hardware) actually has the users' best interests in mind...)

(Not always a safe assumption....)

Comment 45 Alex Butcher 2009-07-06 10:32:12 UTC

Thread on fedora-list about this issue: https://www.redhat.com/archives/fedora-list/2009-July/msg00357.html

Comment 46 Jim Cromie 2009-07-07 19:55:56 UTC

a config file would be nice.
it should allow me to set failed sectors threshold to 1 higher than current,
so that 
1 - I dont have to dismiss the startup warning,
2 - I can lose the !status on the panel
3 - I get warned when something changes.

a text-report of current status, in format usble for config-file
preparation, would sweeten it.

I think this would cover all the "tell me of changes" complaints.

Comment 47 Oded Arbel 2009-07-13 10:23:59 UTC

The problem is in libatasmart's or gdu's understanding of what SMART attribute values mean.

According to man smartctl, an attribute should be considered failing only when its "current normalized value" is equal or lower then "threshold".

On a laptop I have palimpsest report an error "the disk has bad sectors". skdump has this to say (same values as from palimpsest's details view):
---8<----
# skdump /dev/sda
Device: /dev/sda
Type: 16 Byte SCSI ATA SAT Passthru
...
SMART Disk Health Good: yes
Off-line Data Collection Status: [Off-line data collection activity was never started.]
...
Bad Sectors: 1 sectors
Powered On: 4.1 days
Power Cycles: 76
Average Powered On Per Power Cycle: 1.3 h
Temperature: 39.0 C
Overall Status: BAD_SECTOR
ID# Name Value Worst Thres Pretty Raw Type Updates Good
...
5 reallocated-sector-count 100 100 36 1 sectors 0x010000000000 prefail online no
...
190 airflow-temperature-celsius 61 37 45 39.0 C 0x270024270800 old-age online no
...
---8<----

As can be seen, skdump reports correct values for all attributes (compared to smartctl -a /dev/sda report which I will attach shortly), but understands them incorrectly:
1. reallocated-sector-count reports 1 sector reallocated, which is well above the error threshold: normalized value is 100 and threshold is 36.
2. airflow-temperature-celsius is currently OK but it has indeed failed in the past. Still, skdump should not report it as "not good" as currently its value is well above the threshold. This line on the other hand does indicate that skdump knows about "threshold", "current normalized value" and "worst", as it interpreted "worst" being less then "threshold" as a problem.

I don't care much about libatasmart as it is a command line utility and not highly visible to non-technical users, but gnome-disk-utility's UI (palimpsest or however you want to call it) should not report errors in such a visible way (scary tray icon and "in your face" red notification bubble) for attributes that are well above the vendor's threshold.

Comment 48 Oded Arbel 2009-07-13 10:30:18 UTC

Created attachment 351458 [details]
smartctl report for my "failing" drive

This report shows the full SMART data about the drive that gdu marks as "failing".

This is from man smartctl:
---8<---
Each Attribute has a "Raw" value, printed under the heading
"RAW_VALUE", and a "Normalized" value printed under the heading
"VALUE".  ... Each vendor uses their own algorithm to convert this "Raw"
value to a "Normalized" value in the range from 1 to 254. ...
...
Each Attribute also has a Threshold value (whose range is 0 to 255)
which is printed under the heading "THRESH".  If the Normalized value
is **less than or equal to** the Threshold value, then the Attribute
is said to have failed.  If the Attribute is a pre-failure Attribute,
then disk failure is imminent.
---8<---

as can be seen a raw value of 1 for reallocated sector, reported by SMART, is normalized value of 100 which is above the threshold of 36 for my drive. This is not a failure.

Comment 49 Chance Brohm 2009-08-09 00:06:44 UTC

I'd like to add a "me too" - I have dropped gdu-notification-daemon from my startup list.  I'm not comfortable with the thought that I have to monitor disks manually/periodically for signs of failure, but it's better than that orange triangle screaming at me from the notification area without justification.

I'd also like to join Oded in pointing out that SMART already has a scheme in place to determine whether a statistic/attribute indicates imminent failure by comparison of the normalized (current) value against the associated threshold (limit). I wish that the "sense" of those values/comparison were reversed (ie., initial values start low and predict failure when it EXCEEDS the threshold) but that is not the case. I think Oded's explanation is clear and accurate - no alert should be triggered unless the normalized value(s) fall below the threshold.

Aside from the unjustified alarm, I really like g-d-u and the palimpsest UI a LOT.

Comment 50 Javier Alejandro Castro 2009-08-22 03:40:54 UTC

I think i have the same problem some guys on the previous comments. My 1TB HD is shiny new (1 month or less since i installed it). I dont thinkg it is failing... cause... before this disk i had a 250GB which... F11 say it was failing too!

That disk died as F11 said, but i dont really BELIEVE i have the luck of my new disk already failing!!!

There must be a missinterpretation of the attribute values or something like that.

Please people, correct that urgently!

I attach the reports of my 1TB Barracuda HD

Comment 51 Javier Alejandro Castro 2009-08-22 03:43:20 UTC

Created attachment 358294 [details]
skdump of my "failing" disk

Comment 52 Javier Alejandro Castro 2009-08-22 03:44:18 UTC

Created attachment 358295 [details]
devkit-disks dump of my "failing" disk

Comment 53 Oded Arbel 2009-08-23 12:34:49 UTC

You have the same problem as described previously - your new disk as 2 reallocated sectors, which is not really a problem and nothing to write home about: hard drives come with plenty of spare sectors, as the manufacturer can't possibly test every sector of every hard drive they ship and sometimes there are very small failures in the manufacturing process. So the hard drive can detect such failures as it chugs along and handle them silently. Well not very silently - it does report them so you can see if there are too many errors that indicate an imminent disk failure.

2 bad sectors that were handled by the drive's firmware shouldn't trigger an alarm and the drive itself reports that its no cause for concern - the "current value" for "reallocated-sector-count" is way above the error threshold for the drive. The g-d-u devs suggest to fix this by still not listening to the drive's status report but instead use the raw data with a "better" heuristic based on disk size (so 2 re-allocated sectors will still be considered a failure if your disk is small enough, but it'll be OK for 1TB drivers).

I'm not sure when this change is planned to hit Fedora 11 if at all.

Comment 54 Javier Alejandro Castro 2009-08-29 14:33:58 UTC

I've installed F12 alpha in hope for this already fixed, but no luck. Still saying my disk is failing.

Comment 55 Oded Arbel 2009-08-31 06:57:36 UTC

Fedora 12 should have libatasmart 0.14, which according to Bug #506254 should behave. If this is not the case, please comment on bug #506254.

Comment 56 Alex Butcher 2009-08-31 08:51:45 UTC

(In reply to comment #55)
> Fedora 12 should have libatasmart 0.14, which according to Bug #506254 should
> behave. If this is not the case, please comment on bug #506254.  

g-d-u implements its own non-standard logic for the interpretation of reallocated sector and pending sector counts, and so will need to be patched similarly.

See my patches attached to this bug report.

Comment 57 Javier Alejandro Castro 2009-11-07 19:45:12 UTC

F12 Beta does not shows any notification now. Hope this is better? Anyway, from time to time, i run gnome-disk-utility to check for reallocated sector count not increasing.

Comment 58 Jason Haar 2009-11-07 20:59:26 UTC

Created attachment 367972 [details]
devkit-disks dump

Comment 59 Jason Haar 2009-11-07 21:01:57 UTC

I've installed (ie not upgraded) my affected F11 system to F12Beta2 (actually
looks like full Constantine now) and the first time I logged in I got the same
error saying my disk was failing.

Attached is 'devkit-disks --show-info /dev/sda'

Jason

Comment 60 Andre Robatino 2009-11-14 10:36:29 UTC

I have one reallocated sector, and after clean installing F12 (AKA RC4), I don't get the screaming warning anymore.  It's still possible to run palimpsest manually and see the warning there.  Anyone know if there's any provision for a one-time warning if something changes?  Otherwise I have to check regularly for my own protection (a steadily increasing number of bad sectors would be a bad sign even if it hasn't hit the threshold yet).

Comment 61 Stuart D Gathman 2010-01-08 02:51:12 UTC

Created attachment 382378 [details]
Output of smartctl -a /dev/sda

The Reallocated_Sector_Ct raw value of 0x40000a is taken literally in reverse byte order as 655424.

Comment 62 Stuart D Gathman 2010-01-08 02:53:11 UTC

I have the same problem on Fedora 12.  There are 10 reallocated sectors, and the raw value is 0x40000a, which smartctl and palimpsest interpret as 655424.  I think the 0x400000 is a flag of some sort.

Comment 63 Stuart D Gathman 2010-01-15 19:23:15 UTC

It would be helpful if the GUI let you disable alerts for a specific SMART reading.  Currently, you can only disable all alerts for a disk.

Comment 64 Bug Zapper 2010-04-27 13:59:10 UTC

This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 65 Stuart D Gathman 2010-04-28 01:32:57 UTC

Does the process mean that we have to submit a new bug for F12 (where the problem still exists)?

Comment 66 Adam Williamson 2010-04-29 15:35:15 UTC

No, not at all - if you say the problem still exists in F12, we can just adjust this report to be against F12.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 67 Stuart D Gathman 2010-04-29 19:21:42 UTC

What is still broken in F12 is misparsing certain values on certain disks, see Comments 61-63.  There will probably *always* be corner cases like that, so the feature suggested in Comment 63 would let the utility remain useful.  We just need to be able to ignore specific smart values that are getting misinterpreted.  Currently, the only option is to ignore everything.  

If you want to try to protect the user from themselves, you could offer resistance ("are you sure") when ignoring a value the is currently Green (since presumably the user would want to be notified when that status changes later).

Comment 68 Adrian Dinita 2010-05-24 14:42:01 UTC

Created attachment 416152 [details]
skdump for my failing drive test Model: [Hitachi HTS541612J9SA00]

Comment 69 gert 2010-05-28 11:29:20 UTC

(In reply to comment #68)
> Created an attachment (id=416152) [details]
> skdump for my failing drive test Model: [Hitachi HTS541612J9SA00]    

I have a similar model hitachi and it gave the infamous Palimpsest message "DISK MAY BE FAILING" starting the second boot after installing F12. Extensive testing using the hitachi test software shows that there is nothing wrong with the hard disk.

I switched off the Palimpsest notification since it is obvious to me that it is not reliable. I have installed F12 at three different computers (1 notebook with the aforementioned hitachi HD, 2 desktops of which one with a WD hd and the other with a Maxtor.) The instance on the WD did not get the Palimpsest notofication, yet the other two both did. As mentioned after running the manufacturer's extensive tests the drives showed no flaws.

Sorry to say, but Palimpsest is unreliable. More so since s.m.a.r.t. does not seem to produce any failure message in described cases. It seems that certain HD models trigger the message through standard operations while no real problem actually occurred.

So, what to do?

Well, common sense. Have a reliable backup stategy, for instance a file server with raid configuration and (daily?) full or full/sequentiabl backup.

its not as nasty as it may sound.

alternatively get yourself a nas for backup that performs a very frequent (at least daily) backup of mutated and new files. those are available for as little as 150 euro / ?100USD? for personal use.

But silence the palimpsest tool: click on its icon in the panel, click on the message, click on "more information" and finally check "Don't warn me..."

(I'm still puzzled where palimpsest gets its reallocated sector count from, which makes totally no sense.)

Comment 70 gert 2010-05-28 11:36:13 UTC

minor correction: 150 euro obviously is not 100USD. yet prices in US are lower for some 'stupid' reason (envy). EUR 150.- = approx. 180-225USD with present fluctuation in mind. anyhow, i hope you get the picture.

Comment 71 gert 2010-05-28 12:43:52 UTC

minor correction: 150 euro obviously is not 100USD. yet prices in US are lower for some 'stupid' reason (envy). EUR 150.- = approx. 180-225USD with present fluctuation in mind. anyhow, i hope you get the picture.

Comment 72 Adam Williamson 2010-05-28 14:59:43 UTC

gert: your smartctl output would be much more valuable, in this case, than a lecture.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 73 gert 2010-05-29 11:42:27 UTC

(In reply to comment #72)
> gert: your smartctl output would be much more valuable, in this case, than a
> lecture.

Adam, hi. I agree, but I don't have any output available anymore and not sure I can reproduce it. The computers are momentarily not available to me.
I came accross these posts and felt like sharing my experience and point of view to the issue. 

Software have bugs. We have to deal with that. There are always alternatives. Plus one should have a strategy for data integrity. That's all I'm saying.
I didn't mean to come off as a s.m.a.r.t. .ss and I blame it on our cultural differences if I did. In that case: sorry for that.

Just hope that my post has helped some people in one way or the other.
Since this particular message is off-topic I leave it to the admin to remove it.
Hopefully after it has been here a couple of days.

Comment 74 Stuart D Gathman 2010-05-31 22:42:00 UTC

If Palimpsest is not working for your disk(s), then smartd could do the job.  It can email you when something changes.  You can tell it to ignore specific parameters that are changing too often (like temp) or are misinterpreted.  Requires editing a text config file (not a gui).

Palimpsest works for all but one of the 10 disks on desktops here.  The reallocated sectors seem to be a problem on some disks.  See comments 61-63.

Comment 75 Virtual Spam Name 2010-07-12 13:05:04 UTC

Bug: non-existant "Bad Sectors" on WD400UA "detected"
Latest Fedora release, 13, on Intel P4 arch, laptop.
I used manufacturer's hard drive utilities, at least three of them, from various manufacturers, and they all reported healthy hard drive.

Comment 76 Alex Butcher 2010-07-12 13:18:09 UTC

Upstream issue in libatasmart. See http://bugs.freedesktop.org/show_bug.cgi?id=25772

Comment 77 Alex Butcher 2010-07-12 13:41:13 UTC

(In reply to comment #69)
> (I'm still puzzled where palimpsest gets its reallocated sector count from,
> which makes totally no sense.)    

Palimpsest gets its data from libatasmart, which in turn uses the S.M.A.R.T. RAW_VALUEs. Run smartctl -a /dev/sdX on the drive, and you'll see the same count in the RAW_VALUE column. Neither Palimpsest nor libatasmart are buggy in this respect, but I maintain that they are buggy in respect of causing such a "flap" about it when the cooked VALUE hasn't fallen below the manufacturer-set THRESHold.

(In reply to comment #75)

The bad sectors on your drive almost certainly do exist, but the manufacturer tools don't use the raw number of failed sectors in the same way libatasmart/Palimpsest does. Instead, the firmware of each manufacturers' drives perform a proprietary calculation of the raw statistics and compares it with a failure threshold. Only if the cooked attribute is less than the threshold for that attribute will the manufacturer tools report that the drive is faulty and qualifies for RMA.

Comment 78 darko 2010-09-18 11:27:43 UTC

Hello, I am chiming in to report that I also have this issue.

1 have 2 Seagate ST3100 1TB disks. 
I have a fake Nvidia raid 1 mirror setup.
Fedora version: 2.6.34.6-54.fc13.x86_64 #1 SMP Sun Sep 5 17:16:27 UTC 2010 x86_64 x86_64 x86_64 

The gnome disk utility shows a failing disk based upon values below:
reallocated-sector-count     97| 97| 36   good    128 sectors Pre-fail Online

Seagate seems to only provide tools and untilities for Windows. 

I removed both drives. 
I updated firmware on both drives. 
I ran manufacturers, seagate, test utility on both drives. 
Both drives passed the seagate utility tools tests. 
I put both drives back onto my Linux system and booted up. 
Gnome disk utility is still showing "One or more disks are failing" and "DISK HAS MANY BAD SECTORS" warnings.

Comment 79 darko 2010-09-18 11:34:16 UTC

Created attachment 448187 [details]
devkit-disks report for /dev/sdb

as per my posted comment.

Comment 80 darko 2010-09-18 11:38:28 UTC

Created attachment 448188 [details]
updated revised output includes skdump and libatasmart version.

updated revised output includes skdump and libatasmart version.

Comment 81 Stuart D Gathman 2010-09-30 20:48:03 UTC

Still broken in F14 beta.  Actual reallocated count is 12, disk utils reads this as 786514.  It is including flags bytes at the beginning of the raw value of 0x0c0052.  Disk model ATA Hitachi HTS541612J9SA00 firmware version SBDOC7PB

Comment 82 Adam Williamson 2010-10-01 16:48:37 UTC

davidz, this has been open a long time, are you ever planning to fix it? It seems like a fairly significant issue.

Comment 83 David Zeuthen 2010-10-01 17:00:06 UTC

(In reply to comment #82)
> davidz, this has been open a long time, are you ever planning to fix it? It
> seems like a fairly significant issue.

Is someone ever planning to fix the lack of bug triaging in Fedora. It seems like a fairly significant issue </tongue_in_cheek>. Sorry, but bugs won't automagically get fixed if they are assigned to the wrong component. Reassigning.

Comment 84 Adam Williamson 2010-10-01 19:46:31 UTC

triaging requires expert knowledge =) i did wonder about that (though I'd have guessed udisks would be the culprit and hence still your fault :>), but wasn't sure so I didn't want to touch it. I assumed you would've changed it by now if it was wrong.

Comment 85 David Zeuthen 2010-10-01 20:33:53 UTC

(In reply to comment #84)
> triaging requires expert knowledge =) i did wonder about that (though I'd have
> guessed udisks would be the culprit and hence still your fault :>), but wasn't
> sure so I didn't want to touch it. I assumed you would've changed it by now if
> it was wrong.

Yeah - I should have reassigned it earlier but only saw it now and ENOTIME for bugs in the past. And, yeah, the stack is a bit more complicated than I like but, meh, not sure what to do about this except for pointing people to http://www.freedesktop.org/wiki/Software/udisks when filing bugs related to storage.

Comment 86 Chris Halsall 2010-10-01 21:22:04 UTC

(In reply to comment #81)
> Still broken in F14 beta.  Actual reallocated count is 12, disk utils reads
> this as 786514.  It is including flags bytes at the beginning of the raw value
> of 0x0c0052.  Disk model ATA Hitachi HTS541612J9SA00 firmware version SBDOC7PB

And let us please talk about the user's experience...

Under a great deal of uncertainty, a user finally installs Linux.  Fedora, for example.

Immediately, they're told their hard-drive "is failing", even though it isn't

"Oh my god!", most think.  "This Linux thing is causing harm to my hardware."

Meanwhile, those of us who actually know what is going on realize that the software is lying to us, and that the hardware is (most likely) fine...

I, personally, find it *very* interesting that it has taken so long for the "Free/Open Source Software Community" to deal with this subsystem which so deterministically and regularly produces false "fear, uncertainty and doubt (FUD)" reports to those trying to use FOSS for the first time.

Hmmmmmmm....

Comment 87 darko 2010-10-01 23:43:18 UTC

"Oh my god!", most think.  "This Linux thing is causing harm to my hardware." << i little too dramatic. james woods would be proud. 

i think, at most, it's causing some confusion. in my case, i wasn't sure if the  gnome disk utility in fedora was catching errors and h/w problems which the actual vendors disk utilities were not catching. 

anyways, perhaps, until this gets fixed, we can suggest to add it or at least make some mention of it in the release notes of Fedora.

Comment 88 Adam Williamson 2010-10-02 06:44:05 UTC

chris: oh, please can the amateur theatrics. it's not a bug in g-d-u, so davidz wasn't fixing it. it wasn't assigned to lennart, so he didn't know about it. there's no conspiracy. whatever the hell you mean by 'free/open source software community' i've no idea. all i know are people who maintain packages.

Comment 89 Chris Halsall 2010-10-02 22:21:56 UTC

(In reply to comment #88)
> chris: oh, please can the amateur theatrics. it's not a bug in g-d-u, so davidz
> wasn't fixing it. it wasn't assigned to lennart, so he didn't know about it.
> there's no conspiracy. whatever the hell you mean by 'free/open source software
> community' i've no idea. all i know are people who maintain packages.

Sorry.  But I believe this is important.

Several people have been reporting this bug for over 17 months, and only now do we figure out that it's been assigned to the "wrong person".  WTF?

And let me please tell you, I have had two individuals who I've recommend Fedora to who were seriously scared by this constant message of a hard-drive failing when WinBlows didn't.  They thought that Linux was damaging their hardware.

Theatrics aside, the "user experience" is *very* important (unless, of course, we only expect geeks and nerds to use Linux)....

Comment 90 Adam Williamson 2010-10-02 22:59:41 UTC

this is a bug tracking system, not a soapbox. also, this is a volunteer community. you don't achieve anything useful by grandstanding in bug reports, you have to *do* something concrete and productive. people willing to proclaim in grand terms that This Is Important are ten a penny. we're not short on that. what helps a lot more is to actually do stuff. the Bugzappers team needs help (lord, does it ever need help). there's lots of teams you can join to provide some kind of positive impact on 'the user experience'. lecturing on random bugzilla pages isn't going to do much good.

Comment 91 Chris Halsall 2010-10-02 23:31:53 UTC

(In reply to comment #90)
> ... you don't achieve anything useful by grandstanding in bug reports,
> you have to *do* something concrete and productive.

Terribly sorry for wasting your time.

I obviously mistakenly thought providing bug reports was part of the developmental process.

> what helps a lot more is to actually do stuff. the Bugzappers team needs help
> (lord, does it ever need help). there's lots of teams you can join to provide
> some kind of positive impact on 'the user experience'.

Care to provide a URL?  Rather than assuming I already know it.

I'm happy to "do something", so I will know what I "should be doing" rather than what I am currently doing....

Comment 92 darko 2010-10-03 01:52:35 UTC

quick question, is this a fedora/red hat bug or is this an upstream gnome bug? 

as stated earlier, just put a blurb about it in the docs or release notes or something until it gets fixed/patched.

Comment 93 Alex Butcher 2010-10-03 09:32:00 UTC

(In reply to comment #90)
> this is a bug tracking system, not a soapbox. also, this is a volunteer
> community. you don't achieve anything useful by grandstanding in bug reports,
> you have to *do* something concrete and productive.

Such as voluntarily providing patches, you mean?

Like the ones I provided 14 months ago and which got ignored?

Like explaining to the author of libatasmart on its bugtracker 3 months ago why saying the "disc is dying" with only about 20-30 reallocated sectors is not useful behaviour? ( http://bugs.freedesktop.org/show_bug.cgi?id=25772#c10 ) only to be ignored there also?

Debate and criticism are fine, but being disregarded loses community support from those of us who want to be constructive and help.

Comment 94 Adam Williamson 2010-10-03 17:12:16 UTC

exactly like that. I wasn't aiming my comment at you.

Comment 95 Stuart D Gathman 2010-10-04 00:15:39 UTC

The feature in g-d-u could use is a way to ignore individual smart values - as suggested above.  It could very well turn ouit that the disk firmware is buggy (and shouldn't have that flag byte in the raw value).  Whether firmware or drivers, there is always the possibility of a glitch, and it would be nice to disable only the failing smart value, and not all of monitoring.  (Although I just enabled smartd, which *does* support ignoring specific values.)

Comment 96 sawrub 2010-10-17 08:49:30 UTC

Palimpsest Disk Utility says that the disk is having many bad sectors, running the SeaTools for DOS by Seagate [http://tinyurl.com/yk8p3r] shows no such results.
Please help me in finding the issue, whether its with the Disk or with Palimpsest Disk Utility.
Attached is the screen-shot of the utility.

Comment 97 sawrub 2010-10-17 08:50:39 UTC

Created attachment 453898 [details]
Message show by GDU

Comment 98 sawrub 2010-10-17 09:00:12 UTC

All of the data as requested in comment 1 are also here if that helps.

 - devkit-disks --show-info /dev/sda [Attached as devkit-disks]
 - skdump /dev/sda (as root) [Attached as skdump]
 - rpm -q libatasmart 

[sawrub@sawrub ~]$ rpm -q libatasmart
libatasmart-0.17-2.fc12.x86_64

Comment 99 sawrub 2010-10-17 09:01:01 UTC

Created attachment 453899 [details]
devkit-disks

Comment 100 sawrub 2010-10-17 09:01:26 UTC

Created attachment 453900 [details]
skdump

Comment 101 Alex Butcher 2010-10-17 09:13:09 UTC

(In reply to comment #96)
> Palimpsest Disk Utility says that the disk is having many bad sectors, running
> the SeaTools for DOS by Seagate [http://tinyurl.com/yk8p3r] shows no such
> results.
> Please help me in finding the issue, whether its with the Disk or with
> Palimpsest Disk Utility.
> Attached is the screen-shot of the utility.

It looks like your did had (past tense) a problem; the current-pending-sector attribute describes sectors that fail to read correctly. If these sectors are written to, then read back OK, that's the end of the matter (i.e. it was a "soft error", perhaps caused by a failed partial write interrupted by a power failure). If the sector fails to read back OK, then that sector should be reallocated. Your disc is reporting no reallocated sectors, so it was probably a soft error.

Comment 102 sawrub 2010-10-17 10:05:49 UTC

(In reply to comment #101)

> It looks like your did had (past tense) a problem; the current-pending-sector
> attribute describes sectors that fail to read correctly. If these sectors are
> written to, then read back OK, that's the end of the matter (i.e. it was a
> "soft error", perhaps caused by a failed partial write interrupted by a power
> failure). If the sector fails to read back OK, then that sector should be
> reallocated. Your disc is reporting no reallocated sectors, so it was probably
> a soft error.

That sounds good but the utility is showing alert on every boot for around 1 month now, also BIOS missed to locate any HD yesterday, that also just after couple of hours after proper shut down. Which forced me to go out for a new HD to save my data in time.
Any opinion for this.

Comment 103 Alex Butcher 2010-11-29 15:54:02 UTC

Created attachment 463518 [details]
Adapted from Ubuntu libatasmart package

This patch is based on what Ubuntu are shipping to stop libatasmart (and, in turn, palimpsest) from using its own (faulty) heuristic to warn of discs with "many bad sectors". I believe I've left in the zero tolerance for pending sectors, as IMHO, it's best to address those ASAP (i.e. by finding them and rewriting them) sooner, rather than later.

Comment 104 tz 2010-12-10 15:50:31 UTC

I have the same problem - a new SSD that has a number of relocated sectors and have gotten the notification and the scary icon every boot.  If the number of sectors were INCREASING I would like to know, but it might be at the current value for the next two years, and I don't want to keep getting notified, so I hope the patch gets through (is there a way to get it before it hits the repos?)

Comment 105 infertux 2011-01-05 21:01:40 UTC

Same problem here.
Since I bought a new SSD disk, I get a warning about too many bad sectors at each boot.
But the disk has 128 bad sectors since the first day I used it and it has not increasing in six months. According to what I have read on the Internet, SSD technology can have much more bad sectors than magnetic HDDs and 128 is a very normal value.

Comment 109 Stuart D Gathman 2011-12-22 16:29:47 UTC

Please update version for Fedora 16!

Still the same problem in Fedora 16.  It is still showing the wrong bad sector count as in comment#81.

Comment 110 Adam Williamson 2012-01-03 23:27:45 UTC

Updated, thanks for the notification.



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 111 Stuart D Gathman 2012-01-30 17:13:45 UTC

Created attachment 558412 [details]
Screenshot of palimpsest with incorrect bad sector count

I still think there is a problem with palimpsest.  Smartctl -A correctly reports:

196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       15

g-d-u still reports 983137 (again misinterpreting the flag bytes).  

gnome-disk-utility-3.0.2-3.fc16.i686
smartmontools-5.42-1.fc16.i686
libatasmart-0.17-3.fc15.i686

Comment 112 Stuart D Gathman 2012-01-30 17:26:15 UTC

Chris:  when I install Fedora Gnome for people, I check whether g-d-u has this bug with the current disk status, and disable notifying of a failing disk if it does.  I then configure smartd, which correctly reports bad sectors to send email.

Comment 113 Chris Halsall 2012-01-30 19:00:42 UTC

@Stuart...

Do you not find it a bit strange that after so many bug reports those responsible have not corrected this bug?

It's great that your clients have you to help them, but what about those who are just trying to install Linux for the first time on their own, and are told their hard-disks are failing when they boot their new "experimental" Linux environment.

Who profits from that?

Comment 114 Alex Butcher 2012-01-30 20:07:49 UTC

FYI, libatasmart upstream has recently been updated so it now triggers for discs with more than log2(number of 512 byte sectors)*1024 reallocated sectors rather than log2(number of 512 byte sectors). This should make false positives (i.e. a healthy disc reported as failing) rather less likely, but may well now mask true positives (i.e. actually failing discs) instead. The 1024 scaling appears to be an entirely arbitrary fudge factor.

I'm afraid I completely fail to understand why the respective authors of these tools believe they understand disc failure behaviour better than the manufacturers of those discs.

Comment 115 Alex Butcher 2012-01-30 20:08:18 UTC

FYI, libatasmart upstream has recently been updated ( https://bugs.freedesktop.org/show_bug.cgi?id=25772#c11 ) so it now triggers for discs with more than log2(number of 512 byte sectors)*1024 reallocated sectors rather than log2(number of 512 byte sectors). This should make false positives (i.e. a healthy disc reported as failing) rather less likely, but may well now mask true positives (i.e. actually failing discs) instead. The 1024 scaling appears to be an entirely arbitrary fudge factor.

I'm afraid I completely fail to understand why the respective authors of these tools believe they understand disc failure behaviour better than the manufacturers of those discs.

Comment 116 Stuart D Gathman 2012-01-30 20:51:48 UTC

Heh.  The factor of 1024 will *still* report 900000 bad sectors (really 15) on a 120G drive as "failing".  I suppose I should get an account on bugs.freedesktop.org.

Comment 117 Chris Halsall 2012-01-30 21:39:35 UTC

@Alex...

What does WinBlows do?

It probably won't be correct, but it is what the Users expect....

Comment 118 Alex Butcher 2012-01-30 22:06:01 UTC

@Chris 

I don't think I've ever seen Windows alert on SMART values, but I've only used upto XP with any regularity. Third-party tools generally compare the cooked values with the thresholds, as libatasmart does for most other attributes.

Comment 119 Adam Williamson 2012-01-30 23:55:18 UTC

if this bug exists in upstream libatasmart, it may be worth reporting directly to upstream. Lennart, are you ever planning to comment on this bug?



-- 
Fedora Bugzappers volunteer triage team
https://fedoraproject.org/wiki/BugZappers

Comment 120 Adam Williamson 2012-01-30 23:55:40 UTC

chris: AFAIK Windows does not have any built-in SMART failure monitoring.

Comment 121 Stuart D Gathman 2012-02-15 03:21:49 UTC

Added my 2 cents at https://bugs.freedesktop.org/show_bug.cgi?id=25772#c12

Comment 122 xuesong 2012-03-13 18:03:55 UTC

I have some problem too.Threshold 36,value 156 sectors,Assessment Warning .The reslut is FAILED(Read).I almost repalce my disk.

--------------------------------------------------------------------

Hoho,I come from china.This is my first view bug here,hoho

Comment 123 Stuart D Gathman 2012-06-04 14:24:34 UTC

You need to be more explicit.  Perhaps paste the bad sectors lines from the output of smartctl -a /dev/sda, or show us all four numbers from GDU.  If the "normalized" number is 156, then you are *probably* ok until it falls to 36.  But we want to see the "raw" value (last one shown in GDU) to know if it is relevant to this bug (where GDU looks at raw value).

(In reply to comment #122)
> I have some problem too.Threshold 36,value 156 sectors,Assessment Warning
> .The reslut is FAILED(Read).I almost repalce my disk.
> 
> --------------------------------------------------------------------
> 
> Hoho,I come from china.This is my first view bug here,hoho

Comment 124 xuesong 2012-07-27 03:34:57 UTC

(In reply to comment #123)
> You need to be more explicit.  Perhaps paste the bad sectors lines from the
> output of smartctl -a /dev/sda, or show us all four numbers from GDU.  If
> the "normalized" number is 156, then you are *probably* ok until it falls to
> 36.  But we want to see the "raw" value (last one shown in GDU) to know if
> it is relevant to this bug (where GDU looks at raw value).
> 
> (In reply to comment #122)
> > I have some problem too.Threshold 36,value 156 sectors,Assessment Warning
> > .The reslut is FAILED(Read).I almost repalce my disk.
> > 
> > --------------------------------------------------------------------
> > 
> > Hoho,I come from china.This is my first view bug here,hoho

I haven not used gnome3 now,instead of kde.I paste the the command line information as follows
smartctl 5.42 2011-10-20 r3458 [i686-linux-3.1.0-1.2-desktop] (SUSE RPM)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 5400.4
Device Model:     ST9250827AS
Serial Number:    5RG7W31S
LU WWN Device Id: 5 000c50 0142994f5
Firmware Version: 3.AAC
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Fri Jul 27 11:26:39 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
                                                                                        
=== START OF READ SMART DATA SECTION ===                                                
SMART overall-health self-assessment test result: PASSED                                
See vendor-specific Attribute list for marginal Attributes.                             
                                                                                        
General SMART Values:                                                                   
Offline data collection status:  (0x82) Offline data collection activity                
                                        was completed without error.                    
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 113) The previous self-test completed having                                                                                                     
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (  426) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  92) minutes.
SCT capabilities:              (0x0001) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   096   096   020    Old_age   Always       -       4670
  5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail  Always       -       181
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       4835192919
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       11380
 10 Spin_Retry_Count        0x0013   100   100   034    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   096   096   020    Old_age   Always       -       4650
187 Reported_Uncorrect      0x0032   012   012   000    Old_age   Always       -       88
189 High_Fly_Writes         0x003a   090   090   000    Old_age   Always       -       10
190 Airflow_Temperature_Cel 0x0022   052   030   045    Old_age   Always   In_the_past 48 (2 146 70 8 0)
191 G-Sense_Error_Rate      0x0032   095   095   000    Old_age   Always       -       11974
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1226
193 Load_Cycle_Count        0x0022   036   036   000    Old_age   Always       -       128486
194 Temperature_Celsius     0x001a   048   070   000    Old_age   Always       -       48 (0 8 0 0 0)
195 Hardware_ECC_Recovered  0x0012   057   045   000    Old_age   Always       -       77810043
197 Current_Pending_Sector  0x0010   100   100   000    Old_age   Offline      -       5
198 Offline_Uncorrectable   0x003e   100   100   000    Old_age   Always       -       5
199 UDMA_CRC_Error_Count    0x0000   200   200   000    Old_age   Offline      -       1
200 Multi_Zone_Error_Rate   0x0032   100   253   000    Old_age   Always       -       0
202 Data_Address_Mark_Errs  0x0000   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 90 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 90 occurred at disk power-on lifetime: 11275 hours (469 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  2f 00 01 10 00 00 a0 00      00:01:54.785  READ LOG EXT
  60 18 18 00 00 00 00 00      00:01:52.381  READ FPDMA QUEUED
  60 a8 a8 00 00 00 00 00      00:01:52.369  READ FPDMA QUEUED
  60 08 08 00 00 00 00 00      00:01:52.344  READ FPDMA QUEUED
  60 08 08 00 00 00 00 00      00:01:52.344  READ FPDMA QUEUED

Error 89 occurred at disk power-on lifetime: 8853 hours (368 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 18 94 35 e0  Error: UNC at LBA = 0x00359418 = 3511320

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  29 ff 7f 05 94 35 e0 00      00:00:46.385  READ MULTIPLE EXT
  25 ff 7f 05 94 35 e0 00      00:00:46.384  READ DMA EXT
  29 ff 7f 86 93 35 e0 00      00:00:46.372  READ MULTIPLE EXT
  25 ff 7f 86 93 35 e0 00      00:00:46.371  READ DMA EXT
  25 ff 7f 07 93 35 e0 00      00:00:46.370  READ DMA EXT

Error 88 occurred at disk power-on lifetime: 8853 hours (368 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 18 94 35 e0  Error: UNC at LBA = 0x00359418 = 3511320

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff 7f 05 94 35 e0 00      00:00:46.385  READ DMA EXT
  29 ff 7f 86 93 35 e0 00      00:00:46.384  READ MULTIPLE EXT
  25 ff 7f 86 93 35 e0 00      00:00:46.372  READ DMA EXT
  25 ff 7f 07 93 35 e0 00      00:00:46.371  READ DMA EXT
  25 ff 7f 88 92 35 e0 00      00:00:46.370  READ DMA EXT

Error 87 occurred at disk power-on lifetime: 8853 hours (368 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b0 93 35 e0  Error: UNC at LBA = 0x003593b0 = 3511216

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 ff 7f 86 93 35 e0 00      00:00:46.385  READ DMA EXT
  25 ff 7f 07 93 35 e0 00      00:00:46.384  READ DMA EXT
  25 ff 7f 88 92 35 e0 00      00:00:46.372  READ DMA EXT
  25 ff 7f 09 92 35 e0 00      00:00:46.371  READ DMA EXT
  25 ff 7f 8a 91 35 e0 00      00:00:46.370  READ DMA EXT

Error 86 occurred at disk power-on lifetime: 8734 hours (363 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 8e 8e 35 e0  Error: UNC at LBA = 0x00358e8e = 3509902

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  29 ff 10 80 8e 35 e0 00      00:00:31.810  READ MULTIPLE EXT
  25 ff 10 80 8e 35 e0 00      00:00:31.809  READ DMA EXT
  25 ff 7f 01 8e 35 e0 00      00:00:31.808  READ DMA EXT
  25 ff 7f 82 8d 35 e0 00      00:00:31.807  READ DMA EXT
  25 ff 7f 03 8d 35 e0 00      00:00:31.805  READ DMA EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       10%      9791         488018500
# 2  Short offline       Completed: read failure       10%      9678         488027838

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
.
maybe this can help you.

Comment 125 Stuart D Gathman 2012-07-30 17:40:09 UTC

So now the reallocated sector count is 181.  While still well within manufacturer specs, I like to keep my counts closer to zero :-)  Since it has grown from 156 to 181 in just a few months, I would consider replacing the disk.  But the "FAIL" result is incorrect, and a bug.

This again underscores the wrongness of not using the manufacturer limits within SMART.  You can provided a *warning* via heuristic, but *don't* change the SMART status!  A more useful warning than "more reallocations than I've experienced" (since big disks have more reallocations) would be "reallocations increasing at an alarming rate!"

Comment 126 Fedora End Of Life 2013-01-16 22:06:35 UTC

This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 127 Fedora End Of Life 2013-02-14 00:45:02 UTC

Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.