| Summary: | SATA WD Drive changes device name/goes off line | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | A. Mani <a.mani.cms> | ||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 14 | CC: | gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, rtguille | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2011-04-03 14:43:12 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
smartctl -a /dev/sdb smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD10EARS-00Y5B1 Serial Number: <snip> Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Mar 9 07:25:48 2011 IST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (20880) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 240) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 230 132 021 Pre-fail Always - 1475 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 301 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 120 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 296 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 263 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 874 194 Temperature_Celsius 0x0022 105 104 000 Old_age Always - 42 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 8 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 7 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 4 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. From: http://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes http://www.ariolic.com/activesmart/smart-attributes/ 197 - 0xC5 - Current Pending Sector Count : * Count of "unstable" sectors (waiting to be remapped, because of read errors). If an unstable sector is subsequently written or read successfully, this value is decreased and the sector is not remapped. Read errors on a sector will not remap the sector (since it might be readable later); instead, the drive firmware remembers that the sector needs to be remapped, and remaps it the next time it's written. * Hard drives that support this attribute: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, WD (Western Digital) 198 - 0xC6 - Uncorrectable Sector Count: * The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem. (or Off-Line Scan Uncorrectable Sector Count: Fujitsu) * Hard drives that support this attribute: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, WD (Western Digital) * Offline_Uncorrectable 200 - 0xC8 - Write Error Rate (Fujitsu) * The total count of errors when writing a sector. * Write data errors rate. This attribute indicates the total number of errors found when writing a sector. The higher the raw value, the worse the disk surface condition and/or mechanical subsystem is. * Hard drives that support this attribute: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, WD (Western Digital) Attribute 197: you have 8 Attribute 198: you have 7 Attribute 200: you have 4 To me, it seems the drive is bad, be prepared. MOVE the data from the drive. (as a precaution) Monitor it, check if these vaules increase and how much it does. Or you can try to write the drive completelly & many times (with dd) to force it to finally remap the bad sectors... Other alternative is to use wd tools to test it. a current smartctl -a shows an increase in the smart attributes? did you get i/o errors in messagses? did the drive went offline at any time? After I put in a jumper at 3-4 (undocumented), the drive has become OK. 7531 8642 The drive cleared the WD diagnostics. The whole problem was apparently due to the aggressive power saving feature (default jumper-less state) of the drive. _______________________Current Smartctl_________________ smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD10EARS-00Y5B1 Serial Number: <snip> Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Mar 14 04:02:03 2011 IST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (20880) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 240) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 18 3 Spin_Up_Time 0x0027 129 128 021 Pre-fail Always - 6541 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 346 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 159 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 338 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 297 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1165 194 Temperature_Celsius 0x0022 108 101 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 9 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 Warning: ATA error count 21 inconsistent with error log pointer 3 ATA Error Count: 21 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 124 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. ________________________________ >> After I put in a jumper at 3-4 (undocumented), the drive has become OK. From WD & WD "Jumper Settings Info Sheet": http://www.wdc.com/en/library/eide/2579-001037.pdf WD jumpers for SATA 3.0Gb/s (3.5'' HD) none ---> both SSC (Spread Spectrum Clocking) and OPT1 are DISABLED 1-2 ---> SSC is ENABLED 5-6 ---> OPT1 is ENABLED (1.5Gb/s speed) 3-4 ---> UNDOCUMENTED 3-4 ---> PM2 (Power Management 2) ENABLE (only for some disks) 7-8 ---> UNDOCUMENTED (512 Byte HD Sector Disks) 7-8 ---> WinXP Single Partition (4K HW Sectors, Advanced Format Drives) WD jumpers for SATA 6.0Gb/s (3.5'' HD) none ---> both SSC (Spread Spectrum Clocking) and OPT1 are DISABLED 1-2 ---> SSC is ENABLED 5-6 ---> PHY is ENABLED (3.0Gb/s speed) 3-4 ---> UNDOCUMENTED 3-4 ---> PM2 (Power Management 2) ENABLE (only for some disks) 7-8 ---> UNDOCUMENTED (512 Byte HD Sector Disks) 7-8 ---> WinXP Single Partition (4K HW Sectors, Advanced Format Drives) PM2 (Power Management 2) ENABLE (only for some disks) * Power-up in standby? i remember my WD20EARS has that feature. ---- 01 - 0x01 - Read Error Rate * Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number. * Hard drives that support this attribute: Samsung, Seagate, IBM (Hitachi), Fujitsu, Maxtor, WD (Western Digital) Read Error Rate = previously you had 0, now you have "16"? 03 - 0x03 - Spin-Up Time * Average time of spindle spin up (from zero RPM to fully operational [millisecs]). Previously: 1475 , current: 6541 PM2 (Power Management 2), ok Your previous smarctl says: SMART Error Log Version: 1 No Errors Logged Your current smartcl: SMART Error Log Version: 1 Warning: ATA error count 21 inconsistent with error log pointer 3 ATA Error Count: 21 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed without error 00% 124 - from: http://sourceforge.net/apps/trac/smartmontools/wiki/FAQ#Warning:ATAerrorcount9inconsistentwitherrorlogpointer5Whatsthemeaningofthissmartctlmessage 'Warning: ATA error count 9 inconsistent with error log pointer 5' What's the meaning of this smartctl message? * The ATA error log is stored in a circular buffer, and the ATA specifications are unambiguous about how the entries should be ordered. This warning message means that the disk's firmware does not strictly obey the ATA specification regarding the ordering of the error log entries in the circular buffer. Smartmontools will correct for this oversight, so this warning message can be safely ignored by users. (On the other hand, firmware engineers: please read the ATA specs more closely then fix your code!). ------- You still have Current_Pending_Sector = 9 >> The drive cleared the WD diagnostics. There are several levels of thesting, did you perform a write test? (or similar, i don't remember the options names). There are some data-destructive tests. you need to write to those 9 sectors to forece a remap. >> The whole problem was apparently due to the aggressive power saving feature (default jumper-less state) of the drive The green drives certainly perform some power-saving, but these are transparent and should not cause any issue. I own 2 of them. I do not use the PM2 feaure. Yes the drive had errors. The extended WD diagnostic test (from cd) repaired the drive: Code: 0223 Time taken 3 hrs 30 min I will check with WD support. I will close the bug if the drive has to be replaced or if I do not see the problem without jumprs. Thanks New smartctl -a smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD10EARS-00Y5B1 Serial Number: <snip> Firmware Version: 80.00A80 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Mar 15 02:30:17 2011 IST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (20880) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 240) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 24 3 Spin_Up_Time 0x0027 183 128 021 Pre-fail Always - 3825 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 350 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 167 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 339 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 297 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1207 194 Temperature_Celsius 0x0022 105 101 000 Old_age Always - 42 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 9 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 5 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 166 - # 2 Conveyance offline Completed without error 00% 162 - # 3 Conveyance offline Completed without error 00% 124 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Update: WD recommends RMA Latest Smart: SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 24 3 Spin_Up_Time 0x0027 183 128 021 Pre-fail Always - 3816 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 351 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 174 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 340 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 297 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 1259 194 Temperature_Celsius 0x0022 104 101 000 Old_age Always - 43 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1 closing it as drive was not OK. |
Created attachment 483086 [details] hdparm Description of problem: I added a new WD sata drive to my system. It had been ok for a few days, but has started having the problem. Smart test was ok. Version-Release number of selected component (if applicable): Fedora 14 X86-64 (updated stable) How reproducible: Bios settings: ahci or raid with ahci, PCI bus mastering -on/off Drive has msdos partition table with ext4, xfs, jfs partitions Jumpers: none Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: From dmesg|tail exception Emask 0x10 SAct 0x0 SErr 0x40c0000 action 0xe frozen [ 187.475938] ata2: irq_stat 0x00000040, connection status changed [ 187.475975] ata2: SError: { CommWake 10B8B DevExch } [ 187.475991] ata2: limiting SATA link speed to 1.5 Gbps [ 187.476000] ata2: hard resetting link [ 193.821030] ata2: link is slow to respond, please be patient (ready=0) [ 195.709212] ata2: SATA link down (SStatus 0 SControl 310) [ 195.709239] ata2: EH complete _________________________________________ From lspci -v 00:12.0 RAID bus controller: ATI Technologies Inc SB600 Non-Raid-5 SATA Subsystem: Micro-Star International Co., Ltd. Device 7328 Flags: bus master, 66MHz, medium devsel, latency 192, IRQ 22 I/O ports at b000 [size=8] I/O ports at a000 [size=4] I/O ports at 9000 [size=8] I/O ports at 8000 [size=4] I/O ports at 7000 [size=16] Memory at fe8ff800 (32-bit, non-prefetchable) [size=1K] Capabilities: [60] Power Management version 2 Kernel driver in use: ahci ____________________________________________