Bug 844348

Summary:	Some (older) SSDs are slower with rotational=0 flag set
Product:	Red Hat Enterprise Linux 6	Reporter:	Milan Broz <mbroz>
Component:	kernel	Assignee:	Jeff Moyer <jmoyer>
Status:	CLOSED NOTABUG	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.4	CC:	jmoyer, pvrabec, rwheeler
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-08-06 17:00:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Milan Broz 2012-07-30 11:53:46 UTC

Description of problem:

When running performance test over some SSDs, I noticed that some drives
are slower when set to non-rotational mode.

Strangely, this happens only if there was some exact performance
test before (over mapped device).

Here is the output from (all with flushed cache etc)
dd if=<device> of=/dev/null bs=1M count=100 

Rotational flag set to 0, deadline, read-ahead = 256:
/dev/sdc: 104857600 bytes (105 MB) copied, 1.14373 s, 91.7 MB/s


Rotational flag set to *1*, deadline, read-ahead = 256:
/dev/sdc: 104857600 bytes (105 MB) copied, 0.411118 s, 255 MB/s


Blktrace shows much longer output, it seems that IOs started to be split
in too many pieces...

This problem is not present with 3.5.0 on the same machine and disk.

(I will attach reproducer script and logs in next comment.)

Version-Release number of selected component (if applicable):
2.6.32-279.4.1.el6.x86_64

Comment 2 Jeff Moyer 2012-08-06 17:00:08 UTC

This is functioning as designed.  What happens is this:

Using the CFQ I/O scheduler, you run a sequential read workload, with readahead set to 128KB (it doesn't matter whether the rotational flag is set to 0 or 1).  During this run, the device queue depth is driven beyond 4, at which point the kernel marks the device with the QUEUE_FLAG_CQ flag.  When this flag is set, it affects whether and how long the device queue remains plugged (see queue_should_plug).  Basically, if the device is non-rotational and supports command queuing, we go ahead and send requests sooner rather than later, under the assumption that newer SSDs will have no problem driving high IOPS.

The deadline I/O scheduler doesn't drive a queue depth of more than 2 for this particular workload, so you never actually set the QUEUE_FLAG_CQ flag.  Because of that, only read-ahead sized I/Os make it to disk, and you have better throughput.

In general, your test workload is poor.  Buffered I/O to the block device is not a path that we tune for performance.  Also, a single threaded read is fairly simplistic, especially when taken as the lone data point.

So, I'm closing this bugzilla as NOTABUG.

Comment 3 Milan Broz 2012-08-06 18:48:32 UTC

If it is functioning as designed, why it is working better in upstream kernel? :-)

Whatever, I really do not care. This bug was discovered as part of testing of more complex problem upstream.

Comment 4 Jeff Moyer 2012-08-06 19:57:30 UTC

(In reply to comment #3)
> If it is functioning as designed, why it is working better in upstream
> kernel? :-)

The on-stack plugging patches introduced plugging where there previously was none (a quick blktrace run on 3.5.0 confirms this).  So, in other words, I/Os are queued up before even getting to the I/O scheduler.  Thus, by the time they get there, they are "complete."  So, your 256 page read-ahead I/O arrives in one chunk, instead of a bunch of Queue/Merge events.  The on-stack plugging is pretty invasive, so I don't think it's feasible to backport it to RHEL 6.

> Whatever, I really do not care. This bug was discovered as part of testing
> of more complex problem upstream.

If this was a more realistic workload, I'd put more time into it.  I just don't think anyone cares about a single dd to the block device (and if you do care, you can tune the system for this workload).