Bug 844348 - Some (older) SSDs are slower with rotational=0 flag set
Some (older) SSDs are slower with rotational=0 flag set
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Jeff Moyer
Red Hat Kernel QE team
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-30 07:53 EDT by Milan Broz
Modified: 2013-02-28 23:11 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-08-06 13:00:08 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Milan Broz 2012-07-30 07:53:46 EDT
Description of problem:

When running performance test over some SSDs, I noticed that some drives
are slower when set to non-rotational mode.

Strangely, this happens only if there was some exact performance
test before (over mapped device).

Here is the output from (all with flushed cache etc)
dd if=<device> of=/dev/null bs=1M count=100 

Rotational flag set to 0, deadline, read-ahead = 256:
/dev/sdc: 104857600 bytes (105 MB) copied, 1.14373 s, 91.7 MB/s


Rotational flag set to *1*, deadline, read-ahead = 256:
/dev/sdc: 104857600 bytes (105 MB) copied, 0.411118 s, 255 MB/s


Blktrace shows much longer output, it seems that IOs started to be split
in too many pieces...

This problem is not present with 3.5.0 on the same machine and disk.

(I will attach reproducer script and logs in next comment.)

Version-Release number of selected component (if applicable):
2.6.32-279.4.1.el6.x86_64
Comment 2 Jeff Moyer 2012-08-06 13:00:08 EDT
This is functioning as designed.  What happens is this:

Using the CFQ I/O scheduler, you run a sequential read workload, with readahead set to 128KB (it doesn't matter whether the rotational flag is set to 0 or 1).  During this run, the device queue depth is driven beyond 4, at which point the kernel marks the device with the QUEUE_FLAG_CQ flag.  When this flag is set, it affects whether and how long the device queue remains plugged (see queue_should_plug).  Basically, if the device is non-rotational and supports command queuing, we go ahead and send requests sooner rather than later, under the assumption that newer SSDs will have no problem driving high IOPS.

The deadline I/O scheduler doesn't drive a queue depth of more than 2 for this particular workload, so you never actually set the QUEUE_FLAG_CQ flag.  Because of that, only read-ahead sized I/Os make it to disk, and you have better throughput.

In general, your test workload is poor.  Buffered I/O to the block device is not a path that we tune for performance.  Also, a single threaded read is fairly simplistic, especially when taken as the lone data point.

So, I'm closing this bugzilla as NOTABUG.
Comment 3 Milan Broz 2012-08-06 14:48:32 EDT
If it is functioning as designed, why it is working better in upstream kernel? :-)

Whatever, I really do not care. This bug was discovered as part of testing of more complex problem upstream.
Comment 4 Jeff Moyer 2012-08-06 15:57:30 EDT
(In reply to comment #3)
> If it is functioning as designed, why it is working better in upstream
> kernel? :-)

The on-stack plugging patches introduced plugging where there previously was none (a quick blktrace run on 3.5.0 confirms this).  So, in other words, I/Os are queued up before even getting to the I/O scheduler.  Thus, by the time they get there, they are "complete."  So, your 256 page read-ahead I/O arrives in one chunk, instead of a bunch of Queue/Merge events.  The on-stack plugging is pretty invasive, so I don't think it's feasible to backport it to RHEL 6.

> Whatever, I really do not care. This bug was discovered as part of testing
> of more complex problem upstream.

If this was a more realistic workload, I'd put more time into it.  I just don't think anyone cares about a single dd to the block device (and if you do care, you can tune the system for this workload).

Note You need to log in before you can comment on or make changes to this bug.