Bug 503344

Summary: smart tests never complete - Seagate 7200.12 ST31000528AS CC34
Product: [Fedora] Fedora Reporter: Wolfgang Rupprecht <wolfgang.rupprecht>
Component: smartmontoolsAssignee: Michal Hlavinka <mhlavink>
Status: CLOSED CANTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 11CC: bmason, fedoraproservices, hugh, jst, mhlavink, redhat_bugzilla
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-08-07 17:58:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
smartctl -a /dev/sda output ater a long test was run for over 50 minutes
none
Output of smartctl /dev/sdb -a
none
Output of smartctl /dev/sda -a
none
output from smartctl -a /dev/sda none

Description Wolfgang Rupprecht 2009-05-31 14:26:07 UTC
Created attachment 346009 [details]
smartctl -a /dev/sda output ater a long test was run for over 50 minutes

Description of problem:

A long (full) test run with "smartctl -t long /dev/sda" on a Seagate 7200.12 ST31000528AS CC34 hangs showing 90% of the test remaining, even after 24 hours.  The hang is at least somewhat drive-specific since 3 older Seagate drives complete their tests in the expected 1-5 hours.

Version-Release number of selected component (if applicable):
libatasmart.x86_64                    0.12-3.fc11
smartmontools.x86_64                  1:5.38-11.fc11
kernel.x86_64                         2.6.29.4-167.fc11


How reproducible:
always

Steps to Reproduce:
1. smartctl -t long /dev/sda
2. wait a few hours
3. smartctl -a /dev/sda
   notice that the test still says "90% remaining"
  
Actual results:
90% remaining shows after 24 hours.

Expected results:
0% remaining after <4 hours

Additional info:

similar reports:

freebsd: 
   http://www.mail-archive.com/freebsd-hackers@freebsd.org/msg67741.html
debian:
   http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=503439

The debian reports indicates it may be related to ahci and that the tests run normally under the standalone seatools freedos-based smart tests.

Note the attached file shows a long test after 50 / 200 minutes.  It should be displaying at most 80% remaining.

Comment 1 Michal Hlavinka 2009-06-02 16:08:23 UTC
hi, 

do you have scheduled any tests in smartd.conf or somewhere?

Does short test works for you?

Is this disk under heavy load or mostly idle? Can you try to test this disk and keep it idle? (use smartctl -X /dev/sda before test)

Comment 2 Wolfgang Rupprecht 2009-06-02 16:34:10 UTC
I have observed this with both scheduled tests and tests started from the command line.  I do have scheduled tests in /etc/smartd.conf.  I first noticed the problem when after 24 hrs the next scheduled test was aborted because the previous test didn't complete.

The disk is under very light load, but I can't make it totally idle since it is the / filesystem disk (eg rootfs).  The disk light on the computer case rarely flashes.

Short tests at first also hung, but they seem to be finishing now.  I don't know what changed.

after 48 hours, this is what it now shows.  (lifetime hours has incremented, but we are still at 90% remaining)

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%       329         -
# 2  Conveyance offline  Completed without error       00%       278         -
# 3  Short offline       Completed without error       00%       278         -
# 4  Conveyance offline  Aborted by host               90%       277         -
# 5  Short offline       Aborted by host               90%       273         -
# 6  Extended offline    Aborted by host               90%       273         -
# 7  Extended offline    Interrupted (host reset)      90%       252         -
# 8  Extended offline    Interrupted (host reset)      90%       219         -
# 9  Extended offline    Interrupted (host reset)      90%       164         -
#10  Extended offline    Interrupted (host reset)      90%        48         -

I've aborted the tests that have run for 48 hrs with smartctl -X /dev/sda and started a new test as requested.

Comment 3 Michal Hlavinka 2009-06-03 14:28:28 UTC
I don't know what are you using this computer for, but is it possible to do smart test in (almost) idle or it's not possible? If the system is idle, it doesn't matter if root partition is on this disk.

Comment 4 Bug Zapper 2009-06-09 16:53:10 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 5 D. Hugh Redelmeier 2009-06-27 13:57:10 UTC
See also Wolfgang Rupprecht's thread on the Seagate Forum: http://forums.seagate.com/stx/board/message?board.id=ata_drives&thread.id=13219&view=by_date_ascending&page=1

Another user is experiencing this in Gentoo.

Rupprecht: we have to stop meeting like this (in bugzilla).

Comment 6 Michal Hlavinka 2009-06-30 12:34:50 UTC
So, to be clear:
> The debian reports indicates it may be related to ahci and that
> the tests run normally under the standalone seatools freedos-based
> smart tests.

did you personally tried seatools or smartctl with disabled ahci?


Have you tried this disk in different machine or different disk in this machine?

Comment 7 D. Hugh Redelmeier 2009-06-30 14:16:13 UTC
> did you personally tried seatools or smartctl with disabled ahci?

No.  I'm not experiencing this problem: I don't even have one of these drives (Seagate 7200.12 series).  I'm just trying to help figure out what is going on.

Comment 8 Michal Hlavinka 2009-07-01 08:34:14 UTC
(In reply to comment #7)
> > did you personally tried seatools or smartctl with disabled ahci?
> 
> No.  I'm not experiencing this problem: I don't even have one of these drives
> (Seagate 7200.12 series).  I'm just trying to help figure out what is going on.  

ok thanks, but this questions was targeted to wolfgang :)

Comment 9 Wolfgang Rupprecht 2009-07-01 09:44:23 UTC
Just to re-iterate, the computer is my main machine, www, smtp and nfs server.  Taking it single-user for a 5 hour test is going to be painful.  The disk is formatted ext4 and the machine has 8 Gigs of memory, so things tend to be answered out of memory and disk writes only happen very infrequently.  I have to watch for a very long time to even see the disk light flash once.  For all practical purposes it looks like the disk is 99 percent idle.  I don't think it is normal disk IO activity that is interfering with the test.

Comment 10 Michal Hlavinka 2009-07-01 12:19:11 UTC
can you answer my questions from comment #6 ? Thanks

Comment 11 Wolfgang Rupprecht 2009-07-01 17:02:09 UTC
I thought I did answer it.  No.  I did not and don't want to take the computer off-line to run seatools for 5 hrs.  And no, that includes taking the disk out of the computer.

Comment 12 Michal Hlavinka 2009-07-02 11:49:24 UTC
(In reply to comment #11)
> I thought I did answer it.  No.  I did not and don't want to take the computer
> off-line to run seatools for 5 hrs.  And no, that includes taking the disk out
> of the computer.  

Are you able to test another disk in this machine?

Comment 13 Bryan Mason 2009-07-06 04:55:53 UTC
Created attachment 350568 [details]
Output of smartctl /dev/sdb -a

I appear to be hitting this bug as well.  I started an extended self-test using palimpsest, and after several hours, the test had not complete.  Running "smartctl /dev/sdb -l selftest" showed 90% of the test remaining, even after several hours running.

Performing the long self-test using SeaTools completed in about 1h15m.

Device Model:     ST3500410AS
Firmware Version: CC34

This is running on a Dell Inspiron 530.  This is one of two drives that form part of a RAID-1 mirror (if that makes a difference).

Let me know if there's any other testing I can do.

Comment 14 Wolfgang Rupprecht 2009-07-06 05:18:44 UTC
(In reply to comment #12)
> (In reply to comment #11)
> > I thought I did answer it.  No.  I did not and don't want to take the computer
> > off-line to run seatools for 5 hrs.  And no, that includes taking the disk out
> > of the computer.  
> 
> Are you able to test another disk in this machine?  

Yes, all 3 other drives (older seagates) complete the tests as expected (within the test duration estimates the drives give.)  Here are the model numbers/firmware version:
ST31500341AS SD3B
ST3750640AS  3.AAE
ST3250824AS  3.AAD

Comment 15 Bryan Mason 2009-07-06 06:13:30 UTC
Created attachment 350572 [details]
Output of smartctl /dev/sda -a

Interesting...looks like a long test on an almost identical drive (7200.12 but different Model No. and different Firmware Rev) is about to complete (only 10% remaining).

Device Model:     ST3500418AS
Firmware Version: CC44

Comment 16 Michal Hlavinka 2009-07-08 14:30:24 UTC
(In reply to comment #13)
> Created an attachment (id=350568) [details]
> Output of smartctl /dev/sdb -a
> 
> I appear to be hitting this bug as well. I started an extended self-test using
> palimpsest, and after several hours, the test had not complete.

thats another point for not smartmontools fault. Palimpsest does not depend on smartmontools.

> Running
> "smartctl /dev/sdb -l selftest" showed 90% of the test remaining, even after
> several hours running.

> Performing the long self-test using SeaTools completed in about 1h15m.

I guess it's firmware related and SeaTools just contain some workaround for this...

(In reply to comment #14)
> > Are you able to test another disk in this machine?  
> 
> Yes, all 3 other drives (older seagates) complete the tests as expected (within
> the test duration estimates the drives give.)  Here are the model
> numbers/firmware version:
> ST31500341AS SD3B
> ST3750640AS  3.AAE
> ST3250824AS  3.AAD  

and another thing pointing to firmware, since other disks are working it seems it's not mb/driver related

SMART commands are quite simple there is not too much space for doing something wrong if other disks/models/firmwares are working

(In reply to comment #15)
> Interesting...looks like a long test on an almost identical drive (7200.12 but
> different Model No. and different Firmware Rev) is about to complete (only 10%
> remaining).

and another one :)

well... I'll try to ask upstream for some information, but I presume there will be no progress.

If anyone here with not working disks is willing to test smartmontools cvs snapshot, please let me know.

Comment 17 Bryan Mason 2009-07-09 16:58:07 UTC
(In reply to comment #16)
> (In reply to comment #13)
> > Created an attachment (id=350568) [details] [details]
> > Output of smartctl /dev/sdb -a
> > 
> > I appear to be hitting this bug as well. I started an extended
> > self-test using palimpsest, and after several hours, the test had
> > not complete.
> 
> thats another point for not smartmontools fault. Palimpsest does not
> depend on smartmontools.

True.  However, I've also reproduced the problem using just "smartctl /dev/sdb --test=long".

> (In reply to comment #15)
> > Interesting...looks like a long test on an almost identical drive
> > (7200.12 but different Model No. and different Firmware Rev) is
> > about to complete (only 10% remaining).
> 
> and another one :)

I think this is the most compelling evidence that this is a firmware issue.  I've been searching for ways to upgrade the firmware on the problematic drive, but have been unsuccessful. 

> If anyone here with not working disks is willing to test smartmontools cvs
> snapshot, please let me know.

I can do that.

Comment 18 Michal Hlavinka 2009-07-10 09:31:03 UTC
(In reply to comment #17) 
> > If anyone here with not working disks is willing to test smartmontools cvs
> > snapshot, please let me know.
> 
> I can do that.  

you can find new packages here:

http://koji.fedoraproject.org/koji/taskinfo?taskID=1465244

Comment 19 Bryan Mason 2009-07-10 15:50:44 UTC
Thanks!  Testing now...

Comment 20 Bryan Mason 2009-07-10 22:48:06 UTC
No joy.  Still several hours and stuck at 90%.

Comment 21 Michal Hlavinka 2009-07-13 14:39:48 UTC
pity, but it was expected. On smartmontools mailing list is related problem (different seagate disk model). Answer is:

> This is probably a firmware bug.
>
> If this disk supports selective self-tests, please try if this type of 
> test also hangs.
>
> This tests the first 25GB:
>
> #  smartctl -t select,0-49999999 /dev/ice
>
>
> If that finishes, you can test the next 25GB with:
>
> #  smartctl -t select,50000000-99999999 /dev/ice
> 
> or:
>
> #  smartctl -t select,next /dev/ice

original reporter does not replied to this, could you test if this works?

Comment 22 Bryan Mason 2009-07-14 00:03:00 UTC
(In reply to comment #21)
> original reporter does not replied to this, could you test if this works?  

I tested, but it didn't work:

# smartctl -t select,0-49999999 /dev/sdb
smartctl 5.39 2009-07-07 19:28 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Selective self-test routine immediately in off-line mode".
SPAN         STARTING_LBA           ENDING_LBA
   0                    0             49999999
Drive command "Execute SMART Selective self-test routine immediately in off-line mode" successful.
Testing has begun.

[wait several hours]

# smartctl -l selftest /dev/sdb
smartctl 5.39 2009-07-07 19:28 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
# 1  Selective offline   Self-test routine in progress 90%       352         -

Comment 23 Michal Hlavinka 2009-07-14 13:58:45 UTC
this was somehow expected, S.M.A.R.T. commands are quite simple so there is not too much space for doing something bad way. Thanks for testing anyway, I'll reply to old email and see if upstream can advice something.

Comment 24 Michal Hlavinka 2009-08-04 09:34:50 UTC
Unfortunately no good news. Nothing useful came from asking upstream nor from searching for any firmware update related information.

Comment 27 Michal Hlavinka 2009-08-07 17:58:05 UTC
I wasn't able to find any info about firmware upgrade on seagete website.

this is firmware issue, can't fix

Comment 28 Jean-Sébastien Trottier 2009-09-22 17:49:56 UTC
I'm having the same issue with a Seagate Barracuda 7200.9 (ST3250624AS 3.AAE), where the long test gets stuck at 90% remaining.

In comment #14, wolfgang rupprecht stated he had this exact same model and firmware version and the long test completed fine for him.

maybe firmware is not the issue after all

Seagate's site says my drive (based on serial number) does not require firmware update

Is there anything you want me to try?

Comment 29 Michal Hlavinka 2009-10-05 15:19:49 UTC
> Is there anything you want me to try?

unfortunately there is nothing new what could help with this problem

Comment 30 Joseph Pingenot 2011-02-16 18:36:37 UTC
My brother has seen this with a Hitachi Deskstar drive.  Attaching SMART info.

Dunno if reproducing it on another drive helps anyone, but here it is.  

The drive lost some of Windows' important files (this was generated from an Ubuntu CD).  The system log isn't more informative.

Comment 31 Joseph Pingenot 2011-02-16 18:37:07 UTC
Created attachment 479190 [details]
output from smartctl -a /dev/sda

Comment 32 J. Rothschild 2011-02-20 00:22:46 UTC
I believe I have found that this is not a bug in smartmontools and appears to be a drive anomaly that can come and go on its own (without applying updates, cold/hard rebooting, etc). 

I was experiencing the same problem as Wolfgang Rupprecht and found this bug entry via a Google search. It's certainly a rare but somewhat common problem on many Seagate drives... Out of all the drives I monitored over the years, this is the only one I've ever had with this issue. Suddenly a few weeks ago I noticed in the logwatch that the nightly tests were not completing from the previous night on my Seagate Maxtor DiamondMax 21 Model STM3320620AS. I manually ran the short, conveyance and extended tests one at a time, numerous times, and sure enough they did not complete even if I waited for several days. Last night I then ran selective offline tests, incrementing the range of LBAs to be tested on each run, to see where it hung, as this would help identify the LBA region with the problem. The selective tests stopped somewhere between LBA 4999-49999 with the 90% remaining issue. Today, after trying a couple dozen short tests (all of which hung at 90% remaining), I started the selective tests that I had tried last night over again in smaller increments to try to isolate the LBA region with the problem, and all tests completed successfully, instead of the "90% remaining" that had been the result for all attempted tests. I then tried a short test and it completed in the usual 1-minute. Interesting! I tried the conveyance, it completed in 6 minutes, and then I tried the extended "long" test (it took about 2 hrs) -- all completed successfully! But I changed nothing. So for three weeks all tests got stuck at 90% remaining, yet suddenly after the selective offline tests, all tests subsequently succeeded. I am not sure if the selective offline LBA range tests helped to resolve whatever the issue was or not (some sort of smart "wake up call" for the hard drive?), interesting coincidence. My gut feeling though, is that this is a symptom that is related to the individual drive (some type of drive anomaly) and likely has nothing to do with smartmontools. The drive looks to be very healthy per smart output and test results, but if this "getting stuck at 90% remaining" issue comes back, I'll probably just replace the drive for peace of mind. Hopefully this additional behavioural information is of some use to others wondering what's going on and how to proceed. I recommend a selective offline test using ranges to isolate the problem.. if all succeed, great, move on to short test, and so on. If all is well and the issue never happens again, you may want to keep the drive. If you cannot get the tests to complete afterwards or if the issue resurfaces, drive replacement should be considered. In all cases, backup your data. I'll continue to monitor this one drive and will report back with any other developments.

Comment 33 J. Rothschild 2011-02-20 00:30:15 UTC
Based on the update from Joseph Pingenot a few days ago on 2/16/2011, his brother had this issue and lost some data. On that note, this might just be how some Seagate drives flake out when some component is failing. So on that note I am probably going to view this as a "pre failure" symptom and replace the drive in the near future. Yes, my data is backed up and if you're having this issue, you will want to do the same for yourself. 

In case you're wanting to try the selective offline mode to test your drive in LBA ranges, here is an example (man smartctl for more info): 
smartctl -t select,0-50000 /dev/sd<X>