Bug 722716 - block layer issues separate flush request for dm-mp based raw device, leading to incorrect /proc/diskstats and iostat output for writes with open("DSYNC") flag
Summary: block layer issues separate flush request for dm-mp based raw device, leading...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.1
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Mike Snitzer
QA Contact: Storage QE
URL:
Whiteboard:
Depends On:
Blocks: 782183 840683 961026 1056239
TreeView+ depends on / blocked
 
Reported: 2011-07-16 23:34 UTC by Dave Wysochanski
Modified: 2018-11-30 22:58 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-02-25 18:16:30 UTC
Target Upstream Version:


Attachments (Terms of Use)
ftrace files with (485.98 KB, application/x-compressed-tar)
2011-07-17 02:06 UTC, Dave Wysochanski
no flags Details
tarball containing multipath end-io 'dd' experiment with traces (4.08 KB, application/x-compressed-tar)
2011-07-19 22:48 UTC, Dave Wysochanski
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Legacy) 22540 0 None None None Never
Red Hat Knowledge Base (Legacy) 59919 0 None None None Never

Description Dave Wysochanski 2011-07-16 23:34:59 UTC
Environment
* RHEL6
  * 2.6.32-131.4.1.el6.x86_64
* dm-multipath
* writing data with "DSYNC" open flag

This is very easy to reproduce.

Incorrect iostat output is seen when running the 'iorate' (http://iorate.org/) benchmark, and
specifically when running 64KB write IO to /dev/raw/raw1, which is mapped to /dev/dm-2, a dm-mp
device with 2 slow underlying netapp iSCSI paths as follows:
# dmsetup ls --tree | grep -A 3 '253:2'
mpathj (253:2)
 ├─ (8:16)
 └─ (8:0)
# multipath -ll mpathj
mpathj (360a980006e424b626d3464766a784f65) dm-2 NETAPP,LUN
size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 2:0:0:0 sda 8:0    active ready running
  `- 3:0:0:0 sdb 8:16   active ready running

The "avgrq-sz" column for the dm-2 device is exactly half (64) what it should be (128), as seen
by the underlying device, sda.  My guess is, somewhere there is a double counting of IOs to the
dm device, since avgrq-sz is comes from 2 numbers in /proc/diskstats, total writes completed
(the 5th number after the device name) and total sectors written (the 7th number after the device
name).

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00   12.00     0.00  1536.00   128.00     1.00  153.42  83.33 100.00
sdb               0.00     0.00    0.00   12.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00   24.00     0.00  1536.00    64.00     1.00   76.71  41.71 100.10

The open mode of iorate gives a significant clue.  An strace of iorate shows:
[pid 10312] open("/dev/raw/raw1", O_RDWR|O_DSYNC|O_LARGEFILE) = 9^M
[pid 10312] gettimeofday({1310856175, 929273}, NULL) = 0^M
[pid 10312] _llseek(9, 653271040, [653271040], SEEK_SET) = 0^M
[pid 10312] gettimeofday({1310856175, 930243}, NULL) = 0^M
[pid 10312] write(9, "**** Start of Block ****\t\366\322\202\244\0)8"..., 65536) = 65536^M
[pid 10312] gettimeofday({1310856175, 941630}, NULL) = 0^M

The bug is easily reproducible with a 'dd' and the 'dsync' flag:
# dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync

open("/dev/raw/raw1", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC, 0666) = 3^M
dup2(3, 1)                              = 1^M
close(3)                                = 0^M
clock_gettime(CLOCK_MONOTONIC, {188081, 350861182}) = 0^M
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M

This anomaly with avgrq-sz is not seen when using 'dd' to a raw device mapped to dm-mp without
the dsync flag:
# dd if=/dev/zero of=/dev/raw/raw1 bs=64k

It is also not seen when mapping the raw device directly to one of the underlying paths
(/dev/sda), and running:
# dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync

Comment 1 Dave Wysochanski 2011-07-16 23:41:35 UTC
Delving deeper into this, I think what is happening is both paths are recording an IO when only one path actually sends the IO.

Note below output is parsed to just show:
<devicename> <# writes> <# sectors written>

# egrep '(dm-2 |sda |sdb )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort
dm-2 4 256
sda 9594 1207488
sdb 9445 205072
[root@rhel6 bin]# dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync count=1
1+0 records in
1+0 records out
65536 bytes (66 kB) copied, 0.0112903 s, 5.8 MB/s
[root@rhel6 bin]# egrep '(dm-2 |sda |sdb )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort
dm-2 6 384
sda 9595 1207616
sdb 9446 205072


So the 'dd' command should have written a total of 128 sectors, and should have been a single 64kb IO.

Instead, what we see is:
a) dm-2's # of write count goes from 4 to 6 (incorrect), and its sectors written count goes from 256 to 384 (correct)
b) sda's # of write count goes up by 1 (correct), and its sectors written count goes up by 128 (correct)
c) sdb's # of write count goes up by 1 (incorrect), and its sectors written count stays the same.

So it appears the inactive path is getting counted as completing an IO when no IO has actually been sent to the device.

Comment 3 Dave Wysochanski 2011-07-16 23:50:03 UTC
It is important to know if this is a simple counting problem when DSYNC open is used, or could be more extensive, and potentially impact performance (such as unnecessary code being run in an IO completion handler for a path that has no IO).  We saw this at GS last week as we were testing dm-mp performance.

Comment 4 Dave Wysochanski 2011-07-17 02:01:09 UTC
I started to trace this a bit using ftrace.  It looks like when we are using
DSYNC with dd, we get a 3rd call to blk_finish_request():
# grep "blk_finish_request <-blk_end_bidi_request" /tmp/trace-nodsync.txt 
          <idle>-0     [001] 615090.683330: blk_finish_request
<-blk_end_bidi_request
          <idle>-0     [001] 615090.683377: blk_finish_request
<-blk_end_bidi_request
[root@rhel6 ~]# grep "blk_finish_request <-blk_end_bidi_request"
/tmp/trace-dsync.txt 
          <idle>-0     [002] 613832.147387: blk_finish_request
<-blk_end_bidi_request
          <idle>-0     [002] 613832.147387: blk_finish_request
<-blk_end_bidi_request
              dd-11785 [002] 613832.147387: blk_finish_request
<-blk_end_bidi_request

Eventually blk_finish_request() calls our accounting function,
blk_account_io_done(), which mentions not accounting for "flush_rq", but maybe
there's a bug here?
static void blk_account_io_done(struct request *req)
{
        /*
         * Account IO completion.  flush_rq isn't accounted as a
         * normal IO on queueing nor completion.  Accounting the
         * containing request is enough.
         */
        if (blk_do_io_stat(req) && req != &req->q->flush_rq) {
                unsigned long duration = jiffies - req->start_time;
                const int rw = rq_data_dir(req);
                struct hd_struct *part;
                int cpu;

                cpu = part_stat_lock();
                part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));

                part_stat_inc(cpu, part, ios[rw]);
                part_stat_add(cpu, part, ticks[rw], duration);
                part_round_stats(cpu, part);
                part_dec_in_flight(part, rw);

                part_stat_unlock();
        }
}


And if we look a bit deeper comparing the two traces, it certainly looks like
one of the last calls to part_round_stats() may be in question:
# grep "part_round_stats <-blk_finish_reques" /tmp/trace-*
/tmp/trace-dsync.txt:          <idle>-0     [002] 613832.147387:
part_round_stats <-blk_finish_request
/tmp/trace-dsync.txt:          <idle>-0     [002] 613832.147387:
part_round_stats <-blk_finish_request
/tmp/trace-dsync.txt:              dd-11785 [002] 613832.147387:
part_round_stats <-blk_finish_request
/tmp/trace-dsync.txt:              dd-11785 [002] 613832.147387:
part_round_stats <-blk_finish_request
/tmp/trace-nodsync.txt:          <idle>-0     [001] 615090.683333:
part_round_stats <-blk_finish_request
/tmp/trace-nodsync.txt:          <idle>-0     [001] 615090.683379:
part_round_stats <-blk_finish_request


I'll upload both of the traces to the bz.

Comment 5 Dave Wysochanski 2011-07-17 02:06:16 UTC
Created attachment 513505 [details]
ftrace files with

I took these two traces with a "doit.sh" file:
echo function > /debug/tracing/current_tracer
echo 1 > /debug/tracing/tracing_enabled
dd if=/dev/zero of=/dev/raw/raw1 bs=64k count=1
echo 0 > /debug/tracing/tracing_enabled

The only thing I changed was adding "oflag=dsync" to the dd line in the trace-dsync.txt file.

Comment 6 Dave Wysochanski 2011-07-19 22:21:59 UTC
Ok, I'm fairly certain I know what's happening now.

Using systemtap, I have traced the 2 requests going to dm-multipath for the simple dd test above.  The 2 requests are:
1. Actual write request, which looks like this:
    15 dd(26682): -> multipath_map
req->cmd_type=0x1, req->cmd_flags=0x40201, req->cmd_len=16, req->__data_len=0x10000, req->__sector=0x0, req->bio=0xffff880012e64c00
and completes like this:
     0 swapper(0): -> multipath_end_io
req->cmd_type=0x1, req->cmd_flags=0x1040645, req->cmd_len=16, req->__data_len=0x0, req->__sector=0x0, req->bio=0x0
2. A separate request for the SYNC / FLUSH, which looks like this:
 12269 dd(26682): -> multipath_map
req->cmd_type=0x1, req->cmd_flags=0x4840201, req->cmd_len=16, req->__data_len=0x0, req->__sector=0xffffffffffffffff, req->bio=0x0
and completes:
     0 ksoftirqd/2(13): -> multipath_end_io
req->cmd_type=0x1, req->cmd_flags=0x1840245, req->cmd_len=16, req->__data_len=0x0, req->__sector=0xffffffffffffffff, req->bio=0x0

The application is not issuing a separate fsync or anything.  An strace of dd shows this is all dd does:
open("/dev/raw/raw1", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC, 0666) = 3^M
dup2(3, 1)                              = 1^M
close(3)                                = 0^M
clock_gettime(CLOCK_MONOTONIC, {445122, 530388475}) = 0^M
read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M
close(0)                                = 0^M
close(1)                                = 0^M

So it's unclear why it is necessary to issue that separate request, which goes down a separate path, throws off the iostat counts, and my guess is does impact performance for this case.  Why not just set the flags properly in the original write request?  Seems like a bug.

Comment 7 Dave Wysochanski 2011-07-19 22:48:38 UTC
Created attachment 513892 [details]
tarball containing multipath end-io 'dd' experiment with traces

Multipath end-io experiment on dd, with strace, systemtap script, and notes.txt file which explains the environment and steps taken to run the test and diagnose.

Comment 8 Dave Wysochanski 2011-07-19 23:01:26 UTC
I just checked RHEL5.6, and this does not occur.  So it is a regression in RHEL6:

# egrep '(dm-4 |sdg |sdh )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort
dm-4 1 128
sdg 0 0
sdh 1 128
[root@rhel5 ~]# dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync count=1
1+0 records in
1+0 records out
65536 bytes (66 kB) copied, 0.011305 seconds, 5.8 MB/s
[root@rhel5 ~]# egrep '(dm-4 |sdg |sdh )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort
dm-4 2 256
sdg 0 0
sdh 2 256

Comment 9 Dave Wysochanski 2011-07-20 00:11:38 UTC
Another fairly simple stap, and we get a better picture of these two requests, and why they are split:


# stap blk-probe-writes.stp 
Pass 5: starting run.
device dm-2 sector 0 write, process dd[27538]
 0xffffffff81248ba0 : generic_make_request+0x0/0x5b0 [kernel]
 0xffffffff812491df : submit_bio+0x8f/0x120 [kernel]
 0xffffffff811ab28c : dio_bio_submit+0xbc/0xc0 [kernel]
 0xffffffff811abd93 : __blockdev_direct_IO+0x663/0xc40 [kernel]
 0xffffffff811a9a87 : blkdev_direct_IO+0x57/0x60 [kernel]
 0xffffffff8110de62 : generic_file_direct_write+0xc2/0x190 [kernel]
 0xffffffff8110f655 : __generic_file_aio_write+0x345/0x480 [kernel]
 0xffffffff811a91dc : blkdev_aio_write+0x3c/0xa0 [kernel]
 0xffffffff811722aa : do_sync_write+0xfa/0x140 [kernel]
 0xffffffff811725a8 : vfs_write+0xb8/0x1a0 [kernel]
 0xffffffff81172fe1 : sys_write+0x51/0x90 [kernel]
 0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel]
device dm-2 sector 0 write, process dd[27538]
 0xffffffff81248ba0 : generic_make_request+0x0/0x5b0 [kernel]
 0xffffffff812491df : submit_bio+0x8f/0x120 [kernel]
 0xffffffff8124c5c5 : __blkdev_issue_flush+0xa5/0xe0 [kernel]
 0xffffffff8124c616 : blkdev_issue_flush+0x16/0x20 [kernel]
 0xffffffff811a9cef : blkdev_fsync+0x3f/0x70 [kernel]
 0xffffffff811a0831 : vfs_fsync_range+0xa1/0xe0 [kernel]
 0xffffffff811a08bb : generic_write_sync+0x4b/0x50 [kernel]
 0xffffffff811a921e : blkdev_aio_write+0x7e/0xa0 [kernel]
 0xffffffff811722aa : do_sync_write+0xfa/0x140 [kernel]
 0xffffffff811725a8 : vfs_write+0xb8/0x1a0 [kernel]
 0xffffffff81172fe1 : sys_write+0x51/0x90 [kernel]
 0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel]
^C
[root@rhel6 stap]# cat blk-probe-writes.stp 
probe ioblock.request {
    if (bdev==bdev_contains && sector<4) {
	printf("device %s sector %d %s, process %s[%d]\n",
		devname, sector, (rw==BIO_READ ? "read":"write"), execname(), pid())
	print_backtrace()
    }
}

So both calls come from a more generic, higher level routine in the kernel: mm/filemap.c:
/*
 * Write data to the block device.  Only intended for the block device itself
 * and the raw driver which basically is a fake block device.
 *
 * Does not take i_mutex for the write and thus is not for general purpose
 * use.
 */
ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
                         unsigned long nr_segs, loff_t pos)
{
        struct file *file = iocb->ki_filp;
        ssize_t ret;

        BUG_ON(iocb->ki_pos != pos);

        ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
        if (ret > 0 || ret == -EIOCBQUEUED) {
                ssize_t err;

                err = generic_write_sync(file, pos, ret);
                if (err < 0 && ret > 0)
                        ret = err;
        }
        return ret;
}

So we're calling __generic_file_aio_write(), then generic_write_sync().  This is identical to upstream kernel (not surprising).

But digging a little deeper into mm/filemap.c:__generic_file_aio_write(), we see something that indicates to us the reason we have the two requests, yet that this might not be necessary for block devices:

/**
 * __generic_file_aio_write - write data to a file
 *
<snip>
 * This function does all the work needed for actually writing data to a
 * file. It does all basic checks, removes SUID from the file, updates
 * modification times and calls proper subroutines depending on whether we
 * do direct IO or a standard buffered write.
 *
 * It expects i_mutex to be grabbed unless we work on a block device or similar
 * object which does not need locking at all.
 *
 * This function does *not* take care of syncing data in case of O_SYNC write.
 * A caller has to handle it. This is mainly due to the fact that we want to
 * avoid syncing under i_mutex.
 */

Ok, so caller needs to sync data.  But this is due "mainly" to the fact that we want to avoid syncing under i_mutex.  But earlier they say "It expects i_mutex to be grabbed unless we work on a block device or similar object which does not need locking at all".

I guess my next step is upstream, to better understand the history of how this got where it did, and whether anything is possible to be changed.

Comment 10 Dave Wysochanski 2011-07-20 17:01:14 UTC
Tracing the actual iSCSI wire shows that nothing is actually issued across the wire for the 'sync' request, so it seems like at least in my setup, the extra request is pure overhead.  Not sure if it is critical to sync vs. async IO, as far as the process behavior is concerned, or what.  It does seem like the actual separation of the sync and the write into separate requests is standard Linux behavior for a long time.  The counting of the requests is another story.

Comment 11 Dave Wysochanski 2011-07-22 11:47:58 UTC
Here's an article which discusses Posix O_SYNC and O_DSYNC, and some of the history in Linux:
http://lwn.net/Articles/352229/

Comment 16 Dave Wysochanski 2011-07-22 17:36:34 UTC
It seems this is purely a counting problem with iostat & /proc/diskstats.  It seems like a bug, and differs from RHEL5, but there does not seem to be a further problem other than that.

Comment 22 RHEL Program Management 2012-05-03 04:47:13 UTC
Since RHEL 6.3 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 24 RHEL Program Management 2012-12-14 07:03:41 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 29 Alasdair Kergon 2013-10-09 22:35:19 UTC
If there's still something worth investigating here, and skimming through the comments here doesn't convince me, please gather fresh data - from a current upstream kernel perhaps if that shows the same - and we'll start again from a fresh baseline.

Comment 30 RHEL Program Management 2014-02-25 18:16:30 UTC
Product Management has reviewed and declined this request.
You may appeal this decision by reopening this request.


Note You need to log in before you can comment on or make changes to this bug.