Bug 1084928

Summary: ata1.00: failed command: READ FPDMA QUEUED without libata.force=noncq on SAMSUNG MZHPU128HCGM PCIe SSD disk
Product: [Fedora] Fedora Reporter: Dominik 'Rathann' Mierzejewski <dominik>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 20CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, marc, mchehab, sumitrai96
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.17.7-300.fc21 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-12-21 06:36:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/run/initramfs/rdsosreport.log from SystemD emergency shell none

Description Dominik 'Rathann' Mierzejewski 2014-04-07 10:09:01 UTC
Created attachment 883515 [details]
/run/initramfs/rdsosreport.log from SystemD emergency shell

Description of problem:
System fails to boot due to disk read errors.

Version-Release number of selected component (if applicable):
kernel-3.13.9-200.fc20.x86_64

How reproducible:
Always.

Steps to Reproduce:
1. Try booting any F20 kernel

Actual results:
Repeated errors like below:
[   31.747119] sakura kernel: ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[   31.747163] sakura kernel: ata1.00: failed command: READ FPDMA QUEUED
[   31.747193] sakura kernel: ata1.00: cmd 60/08:00:38:08:00/00:00:00:00:00/40 tag 0 ncq 4096 in
                                                      res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   31.747257] sakura kernel: ata1.00: status: { DRDY }
[   31.747277] sakura kernel: ata1.00: failed command: READ FPDMA QUEUED
[   31.747305] sakura kernel: ata1.00: cmd 60/08:08:38:48:06/00:00:00:00:00/40 tag 1 ncq 4096 in
                                                      res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   31.747369] sakura kernel: ata1.00: status: { DRDY }
[   31.747390] sakura kernel: ata1: hard resetting link
[   32.054328] sakura kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   32.059747] sakura kernel: ata1.00: configured for UDMA/133
[   32.059753] sakura kernel: ata1.00: device reported invalid CHS sector 0
[   32.059755] sakura kernel: ata1.00: device reported invalid CHS sector 0
[   32.059763] sakura kernel: ata1: EH complete

SystemD eventually drops out to emergency shell.

Expected results:
Normal boot.

Additional info:
A widely reported workaround of adding libata.force=noncq to kernel command line allows system to boot and function normally.

The machine is a Sony Vaio Pro 13 with Samsung XP941 SSD.

lspci -vn
[...]
03:00.0 0106: 144d:a800 (rev 01) (prog-if 01 [AHCI 1.0])
	Subsystem: 144d:a811
	Flags: bus master, fast devsel, latency 0, IRQ 56
	Memory at f6010000 (32-bit, non-prefetchable) [size=8K]
	Expansion ROM at f6000000 [disabled] [size=64K]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable+ Count=1/2 Maskable+ 64bit+
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [d0] Vital Product Data
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [150] Power Budgeting <?>
	Capabilities: [160] Latency Tolerance Reporting
	Kernel driver in use: ahci

$ sudo hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
	Model Number:       SAMSUNG MZHPU128HCGM-00000              
	Serial Number:      xxxxxxxxxxx60
	Firmware Revision:  UXM6401Q
	Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
	Used: unknown (minor revision code 0x0039) 
	Supported: 9 8 7 6 5 
	Likely used: 9
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  250069680
	LBA48  user addressable sectors:  250069680
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      122104 MBytes
	device size with M = 1000*1000:      128035 MBytes (128 GB)
	cache/buffer size  = unknown
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	WRITE_{DMA|MULTIPLE}_FUA_EXT
	   *	64-bit World wide name
	    	Write-Read-Verify feature set
	   *	WRITE_UNCORRECTABLE_EXT command
	   *	{READ,WRITE}_DMA_EXT_GPL commands
	   *	Segmented DOWNLOAD_MICROCODE
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Gen3 signaling speed (6.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	unknown 76[15]
	    	DMA Setup Auto-Activate optimization
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	SET MAX SETPASSWORD/UNLOCK DMA commands
	   *	WRITE BUFFER DMA command
	   *	READ BUFFER DMA command
	   *	Data Set Management TRIM supported (limit 8 blocks)
Security: 
	Master password revision code = 65534
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
		supported: enhanced erase
	6min for SECURITY ERASE UNIT. 32min for ENHANCED SECURITY ERASE UNIT. 
Logical Unit WWN Device Identifier: 5002538xxxxxxxxx
	NAA		: 5
	IEEE OUI	: 002538
	Unique ID	: xxxxxxxxx
Integrity word not set (found 0x27ef, expected 0x100a5)

Comment 1 Justin M. Forbes 2014-05-21 19:40:08 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.14.4-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 2 Dominik 'Rathann' Mierzejewski 2014-06-04 08:31:14 UTC
Yes, the issue still persists with kernel-3.14.2-200.fc20.x86_64.

Comment 3 Justin M. Forbes 2014-11-13 16:02:21 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.17.2-200.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21.

If you experience different issues, please open a new bug report for those.

Comment 4 Dominik 'Rathann' Mierzejewski 2014-12-01 16:07:57 UTC
Confirming this is still happening with kernel-3.17.4-200.fc20.x86_64, but this kernel recovers gracefully by disabling NCQ automatically:

Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: NCQ disabled due to excessive errors
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: exception Emask 0x0 SAct 0x300 SErr 0x0 action 0x6 frozen
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: failed command: READ FPDMA QUEUED
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: cmd 60/18:40:20:00:16/00:00:00:00:00/40 tag 8 ncq 12288 in
                                                       res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: status: { DRDY }
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: failed command: READ FPDMA QUEUED
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: cmd 60/08:48:10:00:16/00:00:00:00:00/40 tag 9 ncq 4096 in
                                                       res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 01 16:54:26 sakura.greysector.net kernel: ata1.00: status: { DRDY }
Dec 01 16:54:26 sakura.greysector.net kernel: ata1: hard resetting link
Dec 01 16:54:27 sakura.greysector.net kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 01 16:54:27 sakura.greysector.net kernel: ata1.00: configured for UDMA/133
Dec 01 16:54:27 sakura.greysector.net kernel: ata1.00: device reported invalid CHS sector 0
Dec 01 16:54:27 sakura.greysector.net kernel: ata1.00: device reported invalid CHS sector 0
Dec 01 16:54:27 sakura.greysector.net kernel: ata1: EH complete

Comment 5 Dominik 'Rathann' Mierzejewski 2014-12-04 18:19:42 UTC
Looks like the patch from https://bugzilla.kernel.org/show_bug.cgi?id=89171#c8 fixes this issue. Please include it in Fedora package while it's making its way to the main tree.

Comment 6 Dominik 'Rathann' Mierzejewski 2014-12-08 09:10:22 UTC
FYI this is now part of 3.17-stable queue:

https://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git/commit/?id=211d32be66b621940515b45ddf60865dcda246b8

Comment 7 Josh Boyer 2014-12-10 19:31:46 UTC
(In reply to Dominik 'Rathann' Mierzejewski from comment #6)
> FYI this is now part of 3.17-stable queue:
> 
> https://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git/commit/
> ?id=211d32be66b621940515b45ddf60865dcda246b8

Thanks again for the pointer.  I've added it to Fedora git today and it will be in the next build of each.

Comment 8 Fedora Update System 2014-12-17 19:01:59 UTC
kernel-3.17.7-300.fc21 has been submitted as an update for Fedora 21.
https://admin.fedoraproject.org/updates/kernel-3.17.7-300.fc21

Comment 9 Fedora Update System 2014-12-17 19:03:54 UTC
kernel-3.17.7-200.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.17.7-200.fc20

Comment 10 Fedora Update System 2014-12-19 18:31:22 UTC
Package kernel-3.17.7-200.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.17.7-200.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-17283/kernel-3.17.7-200.fc20
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2014-12-21 06:36:16 UTC
kernel-3.17.7-200.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 12 Fedora Update System 2014-12-22 02:32:30 UTC
kernel-3.17.7-300.fc21 has been pushed to the Fedora 21 stable repository.  If problems still persist, please make note of it in this bug report.