Hide Forgot
Environment * RHEL6 * 2.6.32-131.4.1.el6.x86_64 * dm-multipath * writing data with "DSYNC" open flag This is very easy to reproduce. Incorrect iostat output is seen when running the 'iorate' (http://iorate.org/) benchmark, and specifically when running 64KB write IO to /dev/raw/raw1, which is mapped to /dev/dm-2, a dm-mp device with 2 slow underlying netapp iSCSI paths as follows: # dmsetup ls --tree | grep -A 3 '253:2' mpathj (253:2) ├─ (8:16) └─ (8:0) # multipath -ll mpathj mpathj (360a980006e424b626d3464766a784f65) dm-2 NETAPP,LUN size=1.0G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 2:0:0:0 sda 8:0 active ready running `- 3:0:0:0 sdb 8:16 active ready running The "avgrq-sz" column for the dm-2 device is exactly half (64) what it should be (128), as seen by the underlying device, sda. My guess is, somewhere there is a double counting of IOs to the dm device, since avgrq-sz is comes from 2 numbers in /proc/diskstats, total writes completed (the 5th number after the device name) and total sectors written (the 7th number after the device name). Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 0.00 12.00 0.00 1536.00 128.00 1.00 153.42 83.33 100.00 sdb 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 24.00 0.00 1536.00 64.00 1.00 76.71 41.71 100.10 The open mode of iorate gives a significant clue. An strace of iorate shows: [pid 10312] open("/dev/raw/raw1", O_RDWR|O_DSYNC|O_LARGEFILE) = 9^M [pid 10312] gettimeofday({1310856175, 929273}, NULL) = 0^M [pid 10312] _llseek(9, 653271040, [653271040], SEEK_SET) = 0^M [pid 10312] gettimeofday({1310856175, 930243}, NULL) = 0^M [pid 10312] write(9, "**** Start of Block ****\t\366\322\202\244\0)8"..., 65536) = 65536^M [pid 10312] gettimeofday({1310856175, 941630}, NULL) = 0^M The bug is easily reproducible with a 'dd' and the 'dsync' flag: # dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync open("/dev/raw/raw1", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC, 0666) = 3^M dup2(3, 1) = 1^M close(3) = 0^M clock_gettime(CLOCK_MONOTONIC, {188081, 350861182}) = 0^M read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M This anomaly with avgrq-sz is not seen when using 'dd' to a raw device mapped to dm-mp without the dsync flag: # dd if=/dev/zero of=/dev/raw/raw1 bs=64k It is also not seen when mapping the raw device directly to one of the underlying paths (/dev/sda), and running: # dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync
Delving deeper into this, I think what is happening is both paths are recording an IO when only one path actually sends the IO. Note below output is parsed to just show: <devicename> <# writes> <# sectors written> # egrep '(dm-2 |sda |sdb )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort dm-2 4 256 sda 9594 1207488 sdb 9445 205072 [root@rhel6 bin]# dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync count=1 1+0 records in 1+0 records out 65536 bytes (66 kB) copied, 0.0112903 s, 5.8 MB/s [root@rhel6 bin]# egrep '(dm-2 |sda |sdb )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort dm-2 6 384 sda 9595 1207616 sdb 9446 205072 So the 'dd' command should have written a total of 128 sectors, and should have been a single 64kb IO. Instead, what we see is: a) dm-2's # of write count goes from 4 to 6 (incorrect), and its sectors written count goes from 256 to 384 (correct) b) sda's # of write count goes up by 1 (correct), and its sectors written count goes up by 128 (correct) c) sdb's # of write count goes up by 1 (incorrect), and its sectors written count stays the same. So it appears the inactive path is getting counted as completing an IO when no IO has actually been sent to the device.
It is important to know if this is a simple counting problem when DSYNC open is used, or could be more extensive, and potentially impact performance (such as unnecessary code being run in an IO completion handler for a path that has no IO). We saw this at GS last week as we were testing dm-mp performance.
I started to trace this a bit using ftrace. It looks like when we are using DSYNC with dd, we get a 3rd call to blk_finish_request(): # grep "blk_finish_request <-blk_end_bidi_request" /tmp/trace-nodsync.txt <idle>-0 [001] 615090.683330: blk_finish_request <-blk_end_bidi_request <idle>-0 [001] 615090.683377: blk_finish_request <-blk_end_bidi_request [root@rhel6 ~]# grep "blk_finish_request <-blk_end_bidi_request" /tmp/trace-dsync.txt <idle>-0 [002] 613832.147387: blk_finish_request <-blk_end_bidi_request <idle>-0 [002] 613832.147387: blk_finish_request <-blk_end_bidi_request dd-11785 [002] 613832.147387: blk_finish_request <-blk_end_bidi_request Eventually blk_finish_request() calls our accounting function, blk_account_io_done(), which mentions not accounting for "flush_rq", but maybe there's a bug here? static void blk_account_io_done(struct request *req) { /* * Account IO completion. flush_rq isn't accounted as a * normal IO on queueing nor completion. Accounting the * containing request is enough. */ if (blk_do_io_stat(req) && req != &req->q->flush_rq) { unsigned long duration = jiffies - req->start_time; const int rw = rq_data_dir(req); struct hd_struct *part; int cpu; cpu = part_stat_lock(); part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req)); part_stat_inc(cpu, part, ios[rw]); part_stat_add(cpu, part, ticks[rw], duration); part_round_stats(cpu, part); part_dec_in_flight(part, rw); part_stat_unlock(); } } And if we look a bit deeper comparing the two traces, it certainly looks like one of the last calls to part_round_stats() may be in question: # grep "part_round_stats <-blk_finish_reques" /tmp/trace-* /tmp/trace-dsync.txt: <idle>-0 [002] 613832.147387: part_round_stats <-blk_finish_request /tmp/trace-dsync.txt: <idle>-0 [002] 613832.147387: part_round_stats <-blk_finish_request /tmp/trace-dsync.txt: dd-11785 [002] 613832.147387: part_round_stats <-blk_finish_request /tmp/trace-dsync.txt: dd-11785 [002] 613832.147387: part_round_stats <-blk_finish_request /tmp/trace-nodsync.txt: <idle>-0 [001] 615090.683333: part_round_stats <-blk_finish_request /tmp/trace-nodsync.txt: <idle>-0 [001] 615090.683379: part_round_stats <-blk_finish_request I'll upload both of the traces to the bz.
Created attachment 513505 [details] ftrace files with I took these two traces with a "doit.sh" file: echo function > /debug/tracing/current_tracer echo 1 > /debug/tracing/tracing_enabled dd if=/dev/zero of=/dev/raw/raw1 bs=64k count=1 echo 0 > /debug/tracing/tracing_enabled The only thing I changed was adding "oflag=dsync" to the dd line in the trace-dsync.txt file.
Ok, I'm fairly certain I know what's happening now. Using systemtap, I have traced the 2 requests going to dm-multipath for the simple dd test above. The 2 requests are: 1. Actual write request, which looks like this: 15 dd(26682): -> multipath_map req->cmd_type=0x1, req->cmd_flags=0x40201, req->cmd_len=16, req->__data_len=0x10000, req->__sector=0x0, req->bio=0xffff880012e64c00 and completes like this: 0 swapper(0): -> multipath_end_io req->cmd_type=0x1, req->cmd_flags=0x1040645, req->cmd_len=16, req->__data_len=0x0, req->__sector=0x0, req->bio=0x0 2. A separate request for the SYNC / FLUSH, which looks like this: 12269 dd(26682): -> multipath_map req->cmd_type=0x1, req->cmd_flags=0x4840201, req->cmd_len=16, req->__data_len=0x0, req->__sector=0xffffffffffffffff, req->bio=0x0 and completes: 0 ksoftirqd/2(13): -> multipath_end_io req->cmd_type=0x1, req->cmd_flags=0x1840245, req->cmd_len=16, req->__data_len=0x0, req->__sector=0xffffffffffffffff, req->bio=0x0 The application is not issuing a separate fsync or anything. An strace of dd shows this is all dd does: open("/dev/raw/raw1", O_WRONLY|O_CREAT|O_TRUNC|O_DSYNC, 0666) = 3^M dup2(3, 1) = 1^M close(3) = 0^M clock_gettime(CLOCK_MONOTONIC, {445122, 530388475}) = 0^M read(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 65536) = 65536^M close(0) = 0^M close(1) = 0^M So it's unclear why it is necessary to issue that separate request, which goes down a separate path, throws off the iostat counts, and my guess is does impact performance for this case. Why not just set the flags properly in the original write request? Seems like a bug.
Created attachment 513892 [details] tarball containing multipath end-io 'dd' experiment with traces Multipath end-io experiment on dd, with strace, systemtap script, and notes.txt file which explains the environment and steps taken to run the test and diagnose.
I just checked RHEL5.6, and this does not occur. So it is a regression in RHEL6: # egrep '(dm-4 |sdg |sdh )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort dm-4 1 128 sdg 0 0 sdh 1 128 [root@rhel5 ~]# dd if=/dev/zero of=/dev/raw/raw1 bs=64k oflag=dsync count=1 1+0 records in 1+0 records out 65536 bytes (66 kB) copied, 0.011305 seconds, 5.8 MB/s [root@rhel5 ~]# egrep '(dm-4 |sdg |sdh )' /proc/diskstats | awk '{ print $3,$8,$10 }' | sort dm-4 2 256 sdg 0 0 sdh 2 256
Another fairly simple stap, and we get a better picture of these two requests, and why they are split: # stap blk-probe-writes.stp Pass 5: starting run. device dm-2 sector 0 write, process dd[27538] 0xffffffff81248ba0 : generic_make_request+0x0/0x5b0 [kernel] 0xffffffff812491df : submit_bio+0x8f/0x120 [kernel] 0xffffffff811ab28c : dio_bio_submit+0xbc/0xc0 [kernel] 0xffffffff811abd93 : __blockdev_direct_IO+0x663/0xc40 [kernel] 0xffffffff811a9a87 : blkdev_direct_IO+0x57/0x60 [kernel] 0xffffffff8110de62 : generic_file_direct_write+0xc2/0x190 [kernel] 0xffffffff8110f655 : __generic_file_aio_write+0x345/0x480 [kernel] 0xffffffff811a91dc : blkdev_aio_write+0x3c/0xa0 [kernel] 0xffffffff811722aa : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811725a8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81172fe1 : sys_write+0x51/0x90 [kernel] 0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel] device dm-2 sector 0 write, process dd[27538] 0xffffffff81248ba0 : generic_make_request+0x0/0x5b0 [kernel] 0xffffffff812491df : submit_bio+0x8f/0x120 [kernel] 0xffffffff8124c5c5 : __blkdev_issue_flush+0xa5/0xe0 [kernel] 0xffffffff8124c616 : blkdev_issue_flush+0x16/0x20 [kernel] 0xffffffff811a9cef : blkdev_fsync+0x3f/0x70 [kernel] 0xffffffff811a0831 : vfs_fsync_range+0xa1/0xe0 [kernel] 0xffffffff811a08bb : generic_write_sync+0x4b/0x50 [kernel] 0xffffffff811a921e : blkdev_aio_write+0x7e/0xa0 [kernel] 0xffffffff811722aa : do_sync_write+0xfa/0x140 [kernel] 0xffffffff811725a8 : vfs_write+0xb8/0x1a0 [kernel] 0xffffffff81172fe1 : sys_write+0x51/0x90 [kernel] 0xffffffff8100b172 : system_call_fastpath+0x16/0x1b [kernel] ^C [root@rhel6 stap]# cat blk-probe-writes.stp probe ioblock.request { if (bdev==bdev_contains && sector<4) { printf("device %s sector %d %s, process %s[%d]\n", devname, sector, (rw==BIO_READ ? "read":"write"), execname(), pid()) print_backtrace() } } So both calls come from a more generic, higher level routine in the kernel: mm/filemap.c: /* * Write data to the block device. Only intended for the block device itself * and the raw driver which basically is a fake block device. * * Does not take i_mutex for the write and thus is not for general purpose * use. */ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov, unsigned long nr_segs, loff_t pos) { struct file *file = iocb->ki_filp; ssize_t ret; BUG_ON(iocb->ki_pos != pos); ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos); if (ret > 0 || ret == -EIOCBQUEUED) { ssize_t err; err = generic_write_sync(file, pos, ret); if (err < 0 && ret > 0) ret = err; } return ret; } So we're calling __generic_file_aio_write(), then generic_write_sync(). This is identical to upstream kernel (not surprising). But digging a little deeper into mm/filemap.c:__generic_file_aio_write(), we see something that indicates to us the reason we have the two requests, yet that this might not be necessary for block devices: /** * __generic_file_aio_write - write data to a file * <snip> * This function does all the work needed for actually writing data to a * file. It does all basic checks, removes SUID from the file, updates * modification times and calls proper subroutines depending on whether we * do direct IO or a standard buffered write. * * It expects i_mutex to be grabbed unless we work on a block device or similar * object which does not need locking at all. * * This function does *not* take care of syncing data in case of O_SYNC write. * A caller has to handle it. This is mainly due to the fact that we want to * avoid syncing under i_mutex. */ Ok, so caller needs to sync data. But this is due "mainly" to the fact that we want to avoid syncing under i_mutex. But earlier they say "It expects i_mutex to be grabbed unless we work on a block device or similar object which does not need locking at all". I guess my next step is upstream, to better understand the history of how this got where it did, and whether anything is possible to be changed.
Tracing the actual iSCSI wire shows that nothing is actually issued across the wire for the 'sync' request, so it seems like at least in my setup, the extra request is pure overhead. Not sure if it is critical to sync vs. async IO, as far as the process behavior is concerned, or what. It does seem like the actual separation of the sync and the write into separate requests is standard Linux behavior for a long time. The counting of the requests is another story.
Here's an article which discusses Posix O_SYNC and O_DSYNC, and some of the history in Linux: http://lwn.net/Articles/352229/
It seems this is purely a counting problem with iostat & /proc/diskstats. It seems like a bug, and differs from RHEL5, but there does not seem to be a further problem other than that.
Since RHEL 6.3 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux.
If there's still something worth investigating here, and skimming through the comments here doesn't convince me, please gather fresh data - from a current upstream kernel perhaps if that shows the same - and we'll start again from a fresh baseline.
Product Management has reviewed and declined this request. You may appeal this decision by reopening this request.