Description of problem: The I/O latency of a cinder volume after live migration of an instance to which it's attached increases significantly. This stays increased till the VM is stopped and started again.[ vm is booted with cinder volume] This is not the case when using a disk from a nova store backend [ without cinder volume] (or at least the difference isn't so significantly high after a live migration). Ceph 2.0 is backend Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. create a vm with cinder volume live migrate it 2. check using ioping 3. Actual results: Expected results: Additional info: he I/O latency of a cinder volume after live migration of an instance to which it's attached increases significantly. This stays increased till the VM is stopped and started again. Here are some captured data for reference. ioping output on cinder vol attached to instance before live migration: [cloud-user@test-virt-scsi ~]$ sudo ioping -c 10 . 4 KiB <<< . (xfs /dev/sda1): request=1 time=98.0 us (warmup) 4 KiB <<< . (xfs /dev/sda1): request=2 time=135.6 us 4 KiB <<< . (xfs /dev/sda1): request=3 time=155.5 us 4 KiB <<< . (xfs /dev/sda1): request=4 time=161.7 us 4 KiB <<< . (xfs /dev/sda1): request=5 time=148.4 us 4 KiB <<< . (xfs /dev/sda1): request=6 time=354.3 us 4 KiB <<< . (xfs /dev/sda1): request=7 time=138.0 us (fast) 4 KiB <<< . (xfs /dev/sda1): request=8 time=150.7 us 4 KiB <<< . (xfs /dev/sda1): request=9 time=149.6 us 4 KiB <<< . (xfs /dev/sda1): request=10 time=138.6 us (fast) --- . (xfs /dev/sda1) ioping statistics --- 9 requests completed in 1.53 ms, 36 KiB read, 5.87 k iops, 22.9 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 135.6 us / 170.3 us / 354.3 us / 65.6 us After live migration: [cloud-user@test-virt-scsi ~]$ sudo ioping -c 10 . 4 KiB <<< . (xfs /dev/sda1): request=1 time=1.03 ms (warmup) 4 KiB <<< . (xfs /dev/sda1): request=2 time=948.6 us 4 KiB <<< . (xfs /dev/sda1): request=3 time=955.7 us 4 KiB <<< . (xfs /dev/sda1): request=4 time=920.5 us 4 KiB <<< . (xfs /dev/sda1): request=5 time=1.03 ms 4 KiB <<< . (xfs /dev/sda1): request=6 time=838.2 us 4 KiB <<< . (xfs /dev/sda1): request=7 time=1.13 ms (slow) 4 KiB <<< . (xfs /dev/sda1): request=8 time=868.6 us 4 KiB <<< . (xfs /dev/sda1): request=9 time=985.2 us 4 KiB <<< . (xfs /dev/sda1): request=10 time=936.6 us --- . (xfs /dev/sda1) ioping statistics --- 9 requests completed in 8.61 ms, 36 KiB read, 1.04 k iops, 4.08 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 838.2 us / 956.9 us / 1.13 ms / 81.0 us This goes back to an average of 200us again after shutting down and starting up the instance. This is not the case when using a disk from a nova store backend (or at least the difference isn't so significantly high after a live migration). Before migration: [cloud-user@test-no-vol ~]$ sudo ioping -c 10 . 4 KiB <<< . (xfs /dev/sda1): request=1 time=103.6 us (warmup) 4 KiB <<< . (xfs /dev/sda1): request=2 time=236.6 us 4 KiB <<< . (xfs /dev/sda1): request=3 time=284.2 us 4 KiB <<< . (xfs /dev/sda1): request=4 time=163.3 us 4 KiB <<< . (xfs /dev/sda1): request=5 time=168.9 us 4 KiB <<< . (xfs /dev/sda1): request=6 time=221.9 us 4 KiB <<< . (xfs /dev/sda1): request=7 time=182.1 us 4 KiB <<< . (xfs /dev/sda1): request=8 time=154.0 us (fast) 4 KiB <<< . (xfs /dev/sda1): request=9 time=179.8 us 4 KiB <<< . (xfs /dev/sda1): request=10 time=174.4 us --- . (xfs /dev/sda1) ioping statistics --- 9 requests completed in 1.77 ms, 36 KiB read, 5.10 k iops, 19.9 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 154.0 us / 196.1 us / 284.2 us / 40.3 us After migration: [cloud-user@test-no-vol ~]$ sudo ioping -c 10 . 4 KiB <<< . (xfs /dev/sda1): request=1 time=198.1 us (warmup) 4 KiB <<< . (xfs /dev/sda1): request=2 time=229.3 us 4 KiB <<< . (xfs /dev/sda1): request=3 time=253.2 us 4 KiB <<< . (xfs /dev/sda1): request=4 time=230.5 us 4 KiB <<< . (xfs /dev/sda1): request=5 time=443.3 us 4 KiB <<< . (xfs /dev/sda1): request=6 time=390.6 us 4 KiB <<< . (xfs /dev/sda1): request=7 time=249.4 us 4 KiB <<< . (xfs /dev/sda1): request=8 time=231.3 us (fast) 4 KiB <<< . (xfs /dev/sda1): request=9 time=247.9 us 4 KiB <<< . (xfs /dev/sda1): request=10 time=238.2 us (fast) --- . (xfs /dev/sda1) ioping statistics --- 9 requests completed in 2.51 ms, 36 KiB read, 3.58 k iops, 14.0 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 229.3 us / 279.3 us / 443.3 us / 75.1 us ----------------- The storage backend for both cinder and nova is Ceph 2.x. The setup doesn't have any load yet -- so not a bad neighbour effect on the compute node after live migration. Also not related to a specific compute nodes' network performance, since it comes back to 200us after vm shutdown and startup. After repeated tests this can be consistently captured: the cinder volume I/O latency seems to significantly increase after live migration, and doesn't decrease till the instance is shut off, and started again.
Reassigning it to Nova since it's unrelated to Cinder. With the RBD backend Cinder service has no affect on the live migration since both the attach/detach and initialize/terminate connection are basically NOOPs. I think this could be related to a time delta between the source and destination hosts, because a tsc difference could result in a live migrated guest becoming sluggish or even hanging.
Trying to wrap my head around what precisely is causing the I/O latency after migration. And here's the report (I will split it into two comments; the next comment will contain potential tests we'd like to be done to better understand the problem) of my last couple of hours of investigation of this issue with QEMU developers (thanks: Stefan Hajnoczi, Dave Gilbert, and Dan Berrange.) - I see that, from the 'sosreports', that all the disks are RBD volumes: <disk type="network" device="disk"> <driver name="qemu" type="raw" cache="writeback" discard="unmap"/> <source protocol="rbd" name="volumes/volume-5a815515-339d-4539-8ab9-415b662232f7"> Please confirm if it's otherwise. - Next, please confirm if the source and destination hosts are in a synchronized state W.R.T wall-clock time. Because of the following: Quoting a post[1] written by an ex-Red Hatter who was a QEMU live migration developer: "The wall-clock times on both the servers must match. NTP helps with this, but configuring the right time zones and ensuring the times are the same on both the hosts are essential. We have had several bug reports where migration initially succeeds, but later the guest becomes sluggish or stops running entirely; and it was quite difficult to zero down the problem to the time delta on the hosts. Windows guests are especially vulnerable to this source of misconfiguration." [1] https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/ - I also checked on the Ceph upstream channel to rule out a certain Ceph issue: [Kashyap] Is there a performance hit to be expected when two clients access a file/volume at the same time? In this case the said 'client' is QEMU, during live migration. Maybe this forces Ceph to route the request to ensure data consistency or something like that. [Peetaur2] Live migration of RAM and VM, but same storage, yes? [Kashyap] Yes. [Peetaur2] In those cases I wouldn't expect any noticable effect, unless you count the 10ms or whatever when it finishes and pauses for a short time
Tests to understand the behavior more closely --------------------------------------------- [Noted in increasing complexity.] (1) Examine the QEMU I/O request counters. One can see how many requests have been processed. Run: $ virsh domblkstat nova-instance-1 On source (before migration) and destination (after migration) host -- then post that somewhere in a plain text file, to examine the I/O data. Stefan Hajnoczi notes: This approach is only reliable if 1) the guest is idle besides running the benchmark 2) they know what workload the benchmark is generating. If `ioping` only generates 1 write per second, then it could get lost in noise (e.g. guest kernel doing read-ahead). But if you know to expect a significant I/O pattern, then it could be a useful way of checking for disk accesses. (2) It is useful to repeat the `ioping` benchmark on a RAM disk and see if it hits the same problem. Because, this test removes any actual I/O out of the picture. If you still see the latency, this test will tell us, the issue has nothing to do with I/O, but somewhere else. (Thanks to Dave Gilbert for this suggestion.) (3) Use `fio` with 'direct=1' (which uses non-bufferred I/O) is a good sanity check if you want to get a second opinion regarding latency. Here's a config and how to run it (after installing the `fio` RPM): Prepare this config file: $ cat rand-write.fio [writes] rw=randwrite direct=1 ioengine=libaio bs=512 filename=/export/fio-rand-write time_based runtime=30 size=4K And run it as: $ fio rand-write.fio (4) Tracing QEMU disk I/O with SystemTap. If the first test doesn't yield any useful detail, we may have to perform more instrumentation to trace QEMU I/O using SystemTap. Stefan Hajnoczi has suggested an approach with SystemTap: It is possible to trace QEMU disk I/O using the following 'read', 'write', and 'completion' probes: qemu.system.x86_64.virtio_blk_handle_read qemu.system.x86_64.virtio_blk_handle_write qemu.system.x86_64.virtio_blk_rw_complete These probes are defined in /usr/share/systemtap/tapset/qemu-system-x86_64.stp Example symtax (this is not to be taken literally, but we need to use the above mentioned 'virtio_blk' trace events): probe qemu2.system.x86_64.blk_co_preadv { printf("blk=%p,bs=%p,offset=%lu,bytes=%u,flags=%#x\n", blk, bs, offset, bytes, flags); } Its invocation is: $ stap myscript.stp It should find the running QEMU process and print to stdout. NB: The tracepoints are in QEMU so `stap` needs to be run on the host.
thhanks for update - Yes all the disks are RBD - Time is already in sync & verified output in #comment 4 - i will request the consultant to do the requested test but it may take time as [vendor] onsite engagement consulting hrs are over
Please find below result available in case also ======================= Results inline: > (1) Examine the QEMU I/O request counters. One can see how many > requests have been processed. > > Run: > > $ virsh domblkstat nova-instance-1 > > On source (before migration) and destination (after migration) host > -- then post that somewhere in a plain text file, to examine the I/O > data. > > Stefan Hajnoczi notes: > > This approach is only reliable if 1) the guest is idle besides > running the benchmark 2) they know what workload the benchmark is > generating. > > If `ioping` only generates 1 write per second, then it could get > lost in noise (e.g. guest kernel doing read-ahead). But if you > know to expect a significant I/O pattern, then it could be a > useful way of checking for disk accesses. combining the results of the test results with fio tests run later. > (2) It is useful to repeat the `ioping` benchmark on a RAM disk and see > if it hits the same problem. Because, this test removes any actual > I/O out of the picture. If you still see the latency, this test > will tell us, the issue has nothing to do with I/O, but somewhere > else. (Thanks to Dave Gilbert for this suggestion.) I mounted the tmpfs to /srv to run the test. Don't see any significant difference. Before migration the average was 2-3 us. After migration it's also same. It's not an iops issue across -- since a Nova disk (also on ceph) doesn't show any significant latency; only constrained to a Cinder disk [doesn't matter whether it's a cinder-based root disk (BFV), or a Cinder volume attached anywhere in the filesystem]. Reference (also sharing comparisons with nova & cinder-managed disks (to draw attention to the difference between nova disks and cinder disks attached to the same server): tmpfs mount, and ioping test results run before migration. [root@cinder-test ~]# mount -t tmpfs -o size=500m tmpfs /srv/ [root@cinder-test ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 40G 965M 40G 3% / devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 17M 1.9G 1% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup tmpfs 380M 0 380M 0% /run/user/1000 tmpfs 500M 0 500M 0% /srv [root@cinder-test ~]# cd /srv/ PRE-LIVE-MIGRATION RESULTS: ioping test on ramdisk: [root@cinder-test srv]# ioping -c 10 . 4 KiB <<< . (tmpfs tmpfs): request=1 time=2.96 us (warmup) 4 KiB <<< . (tmpfs tmpfs): request=2 time=3.85 us 4 KiB <<< . (tmpfs tmpfs): request=3 time=2.29 us 4 KiB <<< . (tmpfs tmpfs): request=4 time=2.54 us 4 KiB <<< . (tmpfs tmpfs): request=5 time=2.10 us 4 KiB <<< . (tmpfs tmpfs): request=6 time=2.57 us 4 KiB <<< . (tmpfs tmpfs): request=7 time=2.14 us (fast) 4 KiB <<< . (tmpfs tmpfs): request=8 time=2.35 us 4 KiB <<< . (tmpfs tmpfs): request=9 time=2.29 us 4 KiB <<< . (tmpfs tmpfs): request=10 time=2.20 us --- . (tmpfs tmpfs) ioping statistics --- 9 requests completed in 22.3 us, 36 KiB read, 402.9 k iops, 1.54 GiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 2.10 us / 2.48 us / 3.85 us / 507 ns ioping test on nova disk (nova backend is ceph): [root@cinder-test ~]# ioping -c 10 . 4 KiB <<< . (xfs /dev/vda1): request=1 time=80.9 us (warmup) 4 KiB <<< . (xfs /dev/vda1): request=2 time=131.4 us 4 KiB <<< . (xfs /dev/vda1): request=3 time=149.4 us 4 KiB <<< . (xfs /dev/vda1): request=4 time=127.1 us 4 KiB <<< . (xfs /dev/vda1): request=5 time=127.0 us 4 KiB <<< . (xfs /dev/vda1): request=6 time=125.3 us 4 KiB <<< . (xfs /dev/vda1): request=7 time=124.3 us (fast) 4 KiB <<< . (xfs /dev/vda1): request=8 time=148.5 us (slow) 4 KiB <<< . (xfs /dev/vda1): request=9 time=98.9 us (fast) 4 KiB <<< . (xfs /dev/vda1): request=10 time=101.8 us --- . (xfs /dev/vda1) ioping statistics --- 9 requests completed in 1.13 ms, 36 KiB read, 7.94 k iops, 31.0 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 98.9 us / 125.9 us / 149.4 us / 16.4 us ioping test on cinder-managed disk (cinder backend is ceph) [root@cinder-test mnt]# ioping -c 10 . 4 KiB <<< . (xfs /dev/vdb): request=1 time=73.6 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=141.9 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=147.6 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=233.6 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=123.3 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=190.4 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=159.7 us 4 KiB <<< . (xfs /dev/vdb): request=8 time=142.3 us 4 KiB <<< . (xfs /dev/vdb): request=9 time=155.8 us 4 KiB <<< . (xfs /dev/vdb): request=10 time=139.6 us --- . (xfs /dev/vdb) ioping statistics --- 9 requests completed in 1.43 ms, 36 KiB read, 6.28 k iops, 24.5 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 123.3 us / 159.4 us / 233.6 us / 31.5 us ================ POST-LIVE-MIGRATION RESULTS: Ramdisk: [root@cinder-test srv]# ioping -c 10 . 4 KiB <<< . (tmpfs tmpfs): request=1 time=2.13 us (warmup) 4 KiB <<< . (tmpfs tmpfs): request=2 time=2.21 us 4 KiB <<< . (tmpfs tmpfs): request=3 time=1.97 us 4 KiB <<< . (tmpfs tmpfs): request=4 time=1.88 us 4 KiB <<< . (tmpfs tmpfs): request=5 time=4.29 us 4 KiB <<< . (tmpfs tmpfs): request=6 time=1.94 us 4 KiB <<< . (tmpfs tmpfs): request=7 time=1.96 us (fast) 4 KiB <<< . (tmpfs tmpfs): request=8 time=1.99 us (fast) 4 KiB <<< . (tmpfs tmpfs): request=9 time=1.97 us (fast) 4 KiB <<< . (tmpfs tmpfs): request=10 time=4.57 us (slow) --- . (tmpfs tmpfs) ioping statistics --- 9 requests completed in 22.8 us, 36 KiB read, 395.3 k iops, 1.51 GiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 1.88 us / 2.53 us / 4.57 us / 1.02 us Nova disk: [root@cinder-test ~]# ioping -c 10 . 4 KiB <<< . (xfs /dev/vda1): request=1 time=148.4 us (warmup) 4 KiB <<< . (xfs /dev/vda1): request=2 time=200.4 us 4 KiB <<< . (xfs /dev/vda1): request=3 time=186.5 us 4 KiB <<< . (xfs /dev/vda1): request=4 time=186.9 us 4 KiB <<< . (xfs /dev/vda1): request=5 time=184.7 us 4 KiB <<< . (xfs /dev/vda1): request=6 time=191.6 us 4 KiB <<< . (xfs /dev/vda1): request=7 time=184.5 us (fast) 4 KiB <<< . (xfs /dev/vda1): request=8 time=193.1 us 4 KiB <<< . (xfs /dev/vda1): request=9 time=184.9 us (fast) 4 KiB <<< . (xfs /dev/vda1): request=10 time=188.7 us --- . (xfs /dev/vda1) ioping statistics --- 9 requests completed in 1.70 ms, 36 KiB read, 5.29 k iops, 20.7 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 184.5 us / 189.0 us / 200.4 us / 4.93 us Cinder disk: [root@cinder-test mnt]# ioping -c 10 . 4 KiB <<< . (xfs /dev/vdb): request=1 time=688.8 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=1.01 ms 4 KiB <<< . (xfs /dev/vdb): request=3 time=867.7 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=950.4 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=1.25 ms 4 KiB <<< . (xfs /dev/vdb): request=6 time=951.6 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=775.1 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=8 time=776.3 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=9 time=872.2 us 4 KiB <<< . (xfs /dev/vdb): request=10 time=734.2 us (fast) --- . (xfs /dev/vdb) ioping statistics --- 9 requests completed in 8.19 ms, 36 KiB read, 1.10 k iops, 4.29 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 734.2 us / 910.1 us / 1.25 ms / 148.8 us > > (3) Use `fio` with 'direct=1' (which uses non-bufferred I/O) is a good > sanity check if you want to get a second opinion regarding latency. > Here's a config and how to run it (after installing the `fio` RPM): > > Prepare this config file: > $ cat rand-write.fio > [writes] > rw=randwrite > direct=1 > ioengine=libaio > bs=512 > filename=/export/fio-rand-write > time_based > runtime=30 > size=4K > > And run it as: > $ fio rand-write.fio Disk mounts: [root@cinder-test ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 40G 1.3G 39G 4% / <-----------------------------Nova Disk devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 17M 1.9G 1% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup /dev/vdb 40G 33M 40G 1% /mnt <---------------------------Cinder Disk tmpfs 380M 0 380M 0% /run/user/1000 tmpfs 380M 0 380M 0% /run/user/0 Fio conf file in /root/ (nova disk): [root@cinder-test ~]# cat rand-write.fio [writes] rw=randwrite direct=1 ioengine=libaio bs=512 filename=/root/fio-rand-write time_based runtime=30 size=4K Fio conf file in /mnt/ (cinder disk) [root@cinder-test ~]# cat /mnt/rand-write.fio [writes] rw=randwrite direct=1 ioengine=libaio bs=512 filename=/mnt/fio-rand-write time_based runtime=30 size=4K ------------------------ Results -- combining virsh domblkstat nova-instance details (instance on compute node 1 before migration). Pre-migration fio results -- cinder: [root@cinder-test mnt]# fio rand-write.fio writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/13222KB/0KB /s] [0/26.5K/0 iops] [eta 00m:00s] writes: (groupid=0, jobs=1): err= 0: pid=9357: Wed Jun 28 07:00:18 2017 write: io=145385KB, bw=4848.3KB/s, iops=9696, runt= 29987msec slat (usec): min=3, max=605, avg= 6.53, stdev= 3.77 clat (usec): min=20, max=41570, avg=94.66, stdev=341.63 lat (usec): min=28, max=41586, avg=101.40, stdev=344.16 clat percentiles (usec): | 1.00th=[ 27], 5.00th=[ 27], 10.00th=[ 28], 20.00th=[ 30], | 30.00th=[ 30], 40.00th=[ 30], 50.00th=[ 30], 60.00th=[ 31], | 70.00th=[ 32], 80.00th=[ 33], 90.00th=[ 34], 95.00th=[ 39], | 99.00th=[ 1848], 99.50th=[ 1976], 99.90th=[ 2192], 99.95th=[ 2256], | 99.99th=[ 2480] lat (usec) : 50=96.00%, 100=0.18%, 250=0.02%, 500=0.02%, 750=0.01% lat (usec) : 1000=0.01% lat (msec) : 2=3.36%, 4=0.41%, 10=0.01%, 50=0.01% cpu : usr=1.88%, sys=8.58%, ctx=290778, majf=0, minf=32 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=290770/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=145385KB, aggrb=4848KB/s, minb=4848KB/s, maxb=4848KB/s, mint=29987msec, maxt=29987msec Disk stats (read/write): vdb: ios=0/288894, merge=0/0, ticks=0/26948, in_queue=26916, util=89.70% ------------- Pre-migration -- while fio benchmark running on Cinder disk -- command randomly executed while fio was running: ------------------ [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2179025 wr_bytes 1142330368 flush_operations 58 rd_total_times 5608454723 wr_total_times 57818127850 flush_total_times 273423193 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2181619 wr_bytes 1143658496 flush_operations 58 rd_total_times 5608454723 wr_total_times 62239063054 flush_total_times 273423193 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2183176 wr_bytes 1144455680 flush_operations 58 rd_total_times 5608454723 wr_total_times 64742764011 flush_total_times 273423193 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2184700 wr_bytes 1145235968 flush_operations 58 rd_total_times 5608454723 wr_total_times 67016778336 flush_total_times 273423193 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2185951 wr_bytes 1145876480 flush_operations 58 rd_total_times 5608454723 wr_total_times 68892866636 flush_total_times 273423193 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2187604 wr_bytes 1146722816 flush_operations 58 rd_total_times 5608454723 wr_total_times 71698739790 flush_total_times 273423193 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2217610 wr_bytes 1162120192 flush_operations 62 rd_total_times 5608454723 wr_total_times 73398293893 flush_total_times 280723651 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2293585 wr_bytes 1201019392 flush_operations 62 rd_total_times 5608454723 wr_total_times 75041029644 flush_total_times 280723651 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9503 rd_bytes 270524928 wr_req 2390092 wr_bytes 1250430976 flush_operations 62 rd_total_times 5608454723 wr_total_times 77226013060 flush_total_times 280723651 =============== Pre-migration fio results -- nova: -------------- [root@cinder-test ~]# fio rand-write.fio writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/11914KB/0KB /s] [0/23.9K/0 iops] [eta 00m:00s] writes: (groupid=0, jobs=1): err= 0: pid=9312: Wed Jun 28 06:58:11 2017 write: io=367299KB, bw=12256KB/s, iops=24512, runt= 29968msec slat (usec): min=3, max=271, avg= 5.38, stdev= 1.88 clat (usec): min=13, max=2294, avg=33.81, stdev=10.00 lat (usec): min=28, max=2300, avg=39.37, stdev=10.22 clat percentiles (usec): | 1.00th=[ 29], 5.00th=[ 30], 10.00th=[ 30], 20.00th=[ 30], | 30.00th=[ 31], 40.00th=[ 32], 50.00th=[ 33], 60.00th=[ 33], | 70.00th=[ 34], 80.00th=[ 36], 90.00th=[ 39], 95.00th=[ 43], | 99.00th=[ 49], 99.50th=[ 54], 99.90th=[ 69], 99.95th=[ 94], | 99.99th=[ 278] lat (usec) : 20=0.01%, 50=99.08%, 100=0.87%, 250=0.02%, 500=0.02% lat (usec) : 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01% cpu : usr=3.25%, sys=18.25%, ctx=734602, majf=0, minf=31 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=734597/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=367298KB, aggrb=12256KB/s, minb=12256KB/s, maxb=12256KB/s, mint=29968msec, maxt=29968msec Disk stats (read/write): vda: ios=0/732575, merge=0/0, ticks=0/23982, in_queue=23901, util=79.66% --------- Data collected while running the fio tests: Pre migration -- while fio benchmark running on nova root disk -- command randomly executed while benchmark was running: -------------- [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 1618880 wr_bytes 855159808 flush_operations 40 rd_total_times 5608198928 wr_total_times 41750684997 flush_total_times 224341403 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 1701879 wr_bytes 897655296 flush_operations 40 rd_total_times 5608198928 wr_total_times 43544355787 flush_total_times 224341403 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 1798892 wr_bytes 947325952 flush_operations 40 rd_total_times 5608198928 wr_total_times 45711246318 flush_total_times 224341403 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 1888475 wr_bytes 993192448 flush_operations 40 rd_total_times 5608198928 wr_total_times 47692698583 flush_total_times 224341403 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 1975088 wr_bytes 1037538304 flush_operations 40 rd_total_times 5608198928 wr_total_times 49570697933 flush_total_times 224341403 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 2030490 wr_bytes 1065904128 flush_operations 40 rd_total_times 5608198928 wr_total_times 51176172805 flush_total_times 224341403 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 2092205 wr_bytes 1097502720 flush_operations 42 rd_total_times 5608198928 wr_total_times 52648933062 flush_total_times 229713279 [root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c rd_req 9502 rd_bytes 270520832 wr_req 2137502 wr_bytes 1120694784 flush_operations 42 rd_total_times 5608198928 wr_total_times 53629498113 flush_total_times 229713279 --------------------- POST MIGRATION: Instance has moved to compute node 2 (from compute node 1) _________________ Post-live-migration Cinder fio bench results (note the drastic drop in IOPS between pre- and post-migration by almost 20 times: [root@cinder-test mnt]# fio rand-write.fio writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/310KB/0KB /s] [0/620/0 iops] [eta 00m:00s] writes: (groupid=0, jobs=1): err= 0: pid=9378: Wed Jun 28 07:09:08 2017 write: io=8807.6KB, bw=300629B/s, iops=587, runt= 30000msec slat (usec): min=8, max=252, avg=17.54, stdev= 5.43 clat (usec): min=1062, max=41568, avg=1680.09, stdev=870.98 lat (usec): min=1079, max=41585, avg=1698.03, stdev=871.16 clat percentiles (usec): | 1.00th=[ 1256], 5.00th=[ 1368], 10.00th=[ 1432], 20.00th=[ 1496], | 30.00th=[ 1544], 40.00th=[ 1592], 50.00th=[ 1640], 60.00th=[ 1704], | 70.00th=[ 1752], 80.00th=[ 1816], 90.00th=[ 1912], 95.00th=[ 1992], | 99.00th=[ 2160], 99.50th=[ 2256], 99.90th=[ 3920], 99.95th=[10688], | 99.99th=[41728] lat (msec) : 2=95.12%, 4=4.79%, 10=0.04%, 20=0.01%, 50=0.05% cpu : usr=0.46%, sys=1.50%, ctx=17618, majf=0, minf=32 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=17615/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=8807KB, aggrb=293KB/s, minb=293KB/s, maxb=293KB/s, mint=30000msec, maxt=30000msec Disk stats (read/write): vdb: ios=0/17568, merge=0/0, ticks=0/29436, in_queue=29429, util=98.09% ------------------ Post-migration Cinder: -- command randomly executed during fio benchmark. ---------------------- [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 13008 wr_bytes 6668800 flush_operations 4 rd_total_times 0 wr_total_times 21240709235 flush_total_times 8675859 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 14308 wr_bytes 7334400 flush_operations 4 rd_total_times 0 wr_total_times 23404807773 flush_total_times 8675859 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 17621 wr_bytes 9032192 flush_operations 8 rd_total_times 0 wr_total_times 28647256606 flush_total_times 11206144 ==================== Post-migration fio bench results on the nova disk (IOPS results captured is consistent between pre- and post-migration): ------------------------- [root@cinder-test ~]# fio rand-write.fio writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/12347KB/0KB /s] [0/24.7K/0 iops] [eta 00m:00s] writes: (groupid=0, jobs=1): err= 0: pid=9383: Wed Jun 28 07:11:33 2017 write: io=417958KB, bw=13948KB/s, iops=27896, runt= 29965msec slat (usec): min=2, max=270, avg= 5.08, stdev= 2.03 clat (usec): min=10, max=8217, avg=29.08, stdev=19.96 lat (usec): min=25, max=8222, avg=34.34, stdev=20.08 clat percentiles (usec): | 1.00th=[ 27], 5.00th=[ 27], 10.00th=[ 27], 20.00th=[ 28], | 30.00th=[ 28], 40.00th=[ 28], 50.00th=[ 28], 60.00th=[ 28], | 70.00th=[ 29], 80.00th=[ 30], 90.00th=[ 31], 95.00th=[ 33], | 99.00th=[ 43], 99.50th=[ 45], 99.90th=[ 55], 99.95th=[ 73], | 99.99th=[ 282] lat (usec) : 20=0.01%, 50=99.81%, 100=0.16%, 250=0.01%, 500=0.02% lat (usec) : 750=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% cpu : usr=4.41%, sys=19.89%, ctx=835921, majf=0, minf=31 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=835916/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=417958KB, aggrb=13948KB/s, minb=13948KB/s, maxb=13948KB/s, mint=29965msec, maxt=29965msec Disk stats (read/write): vda: ios=0/833772, merge=0/0, ticks=0/24135, in_queue=24065, util=80.20% ---------- Post-migration Nova -- command randomly executed during fio benchmark: ------------------- [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 368858 wr_bytes 188915712 flush_operations 18 rd_total_times 0 wr_total_times 36183512565 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 450204 wr_bytes 230564864 flush_operations 18 rd_total_times 0 wr_total_times 37912236976 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 490439 wr_bytes 251165184 flush_operations 18 rd_total_times 0 wr_total_times 38772367741 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 545902 wr_bytes 279562240 flush_operations 18 rd_total_times 0 wr_total_times 39971329161 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 606208 wr_bytes 310438912 flush_operations 18 rd_total_times 0 wr_total_times 41271424813 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 664185 wr_bytes 340123136 flush_operations 18 rd_total_times 0 wr_total_times 42512679727 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 753278 wr_bytes 385738752 flush_operations 18 rd_total_times 0 wr_total_times 44419059221 flush_total_times 23622536 [root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c rd_req 0 rd_bytes 0 wr_req 810608 wr_bytes 415091712 flush_operations 18 rd_total_times 0 wr_total_times 45650939594 flush_total_times 23622536 Best, Atanu
Comment #9 is referring to a VM instance-00000008c, which appears to have 2 disks The sosreport we've got only has instance-00000056, instance-00000059, and instance-0000005c, all of which only have 1 disk, which appears to be cinder backed. To understand why the 2 disks in instance-0000008c have differing performance behaviour we need to understand the QEMU configuration of this instance with 2 disks, so see what's different about the configuration of cinder vs nova disks in this setup. So can we get the /var/run/libvirt/qemu/instance-0000008c.xml file from the target host
Please find below also fresh sosreport on collabshell from both compute node 1 where the VM was migrated from (node that I ran a few live migrations of instance (4c7a067d-6c06-49c7-a675-802730b00b90) between one and two to capture the results.... the final migration was from node 1 to node 2 =================== Copying the xml file below. Also will attach the sosreports from Compute Node 1 & 2. The instance ID is 4c7a067d-6c06-49c7-a675-802730b00b90; and it was migrated from compute node 1 to 2 (in fact, a few migrations between one and two -- but the final one was from 1 to 2) while capturing the data yesterday. Best, Atanu [root@myr-gs-sg-ocpn-02 ~]# virsh list Id Name State ---------------------------------------------------- 20 instance-0000008c running [root@myr-gs-sg-ocpn-02 ~]# cat /var/run/libvirt/qemu/instance-0000008c.xml <!-- WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE OVERWRITTEN AND LOST. Changes to this xml configuration should be made using: virsh edit instance-0000008c or other application using the libvirt API. --> <domstatus state='running' reason='booted' pid='424294'> <monitor path='/var/lib/libvirt/qemu/domain-20-instance-0000008c/monitor.sock' json='1' type='unix'/> <vcpus> <vcpu id='0' pid='424362'/> <vcpu id='1' pid='424363'/> </vcpus> <qemuCaps> <flag name='kvm'/> <flag name='mem-path'/> <flag name='drive-serial'/> <flag name='chardev'/> <flag name='monitor-json'/> <flag name='sdl'/> <flag name='netdev'/> <flag name='rtc'/> <flag name='vhost-net'/> <flag name='no-hpet'/> <flag name='no-kvm-pit'/> <flag name='pci-configfd'/> <flag name='nodefconfig'/> <flag name='boot-menu'/> <flag name='fsdev'/> <flag name='name-process'/> <flag name='smbios-type'/> <flag name='vga-qxl'/> <flag name='spice'/> <flag name='vga-none'/> <flag name='boot-index'/> <flag name='hda-duplex'/> <flag name='drive-aio'/> <flag name='pci-multibus'/> <flag name='pci-bootindex'/> <flag name='ccid-emulated'/> <flag name='ccid-passthru'/> <flag name='chardev-spicevmc'/> <flag name='virtio-tx-alg'/> <flag name='device-qxl-vga'/> <flag name='pci-multifunction'/> <flag name='virtio-blk-pci.ioeventfd'/> <flag name='sga'/> <flag name='virtio-blk-pci.event_idx'/> <flag name='virtio-net-pci.event_idx'/> <flag name='cache-directsync'/> <flag name='piix3-usb-uhci'/> <flag name='piix4-usb-uhci'/> <flag name='usb-ehci'/> <flag name='ich9-usb-ehci1'/> <flag name='vt82c686b-usb-uhci'/> <flag name='usb-redir'/> <flag name='usb-hub'/> <flag name='no-shutdown'/> <flag name='cache-unsafe'/> <flag name='ich9-ahci'/> <flag name='no-acpi'/> <flag name='fsdev-readonly'/> <flag name='virtio-blk-pci.scsi'/> <flag name='drive-copy-on-read'/> <flag name='fsdev-writeout'/> <flag name='drive-iotune'/> <flag name='system_wakeup'/> <flag name='scsi-disk.channel'/> <flag name='scsi-block'/> <flag name='transaction'/> <flag name='block-job-async'/> <flag name='scsi-cd'/> <flag name='ide-cd'/> <flag name='no-user-config'/> <flag name='hda-micro'/> <flag name='dump-guest-memory'/> <flag name='nec-usb-xhci'/> <flag name='balloon-event'/> <flag name='bridge'/> <flag name='virtio-scsi-pci'/> <flag name='blockio'/> <flag name='disable-s3'/> <flag name='disable-s4'/> <flag name='usb-redir.filter'/> <flag name='ide-drive.wwn'/> <flag name='scsi-disk.wwn'/> <flag name='seccomp-sandbox'/> <flag name='reboot-timeout'/> <flag name='dump-guest-core'/> <flag name='seamless-migration'/> <flag name='block-commit'/> <flag name='vnc'/> <flag name='drive-mirror'/> <flag name='usb-redir.bootindex'/> <flag name='usb-host.bootindex'/> <flag name='blockdev-snapshot-sync'/> <flag name='qxl'/> <flag name='VGA'/> <flag name='cirrus-vga'/> <flag name='device-video-primary'/> <flag name='add-fd'/> <flag name='nbd-server'/> <flag name='virtio-rng'/> <flag name='rng-random'/> <flag name='rng-egd'/> <flag name='dtb'/> <flag name='ipv6-migration'/> <flag name='machine-opt'/> <flag name='machine-usb-opt'/> <flag name='pci-bridge'/> <flag name='vfio-pci'/> <flag name='vfio-pci.bootindex'/> <flag name='scsi-generic'/> <flag name='scsi-generic.bootindex'/> <flag name='mem-merge'/> <flag name='vnc-websocket'/> <flag name='drive-discard'/> <flag name='mlock'/> <flag name='vnc-share-policy'/> <flag name='device-del-event'/> <flag name='dmi-to-pci-bridge'/> <flag name='i440fx-pci-hole64-size'/> <flag name='q35-pci-hole64-size'/> <flag name='usb-storage'/> <flag name='usb-storage.removable'/> <flag name='virtio-mmio'/> <flag name='ich9-intel-hda'/> <flag name='kvm-pit-lost-tick-policy'/> <flag name='boot-strict'/> <flag name='pvpanic'/> <flag name='spice-file-xfer-disable'/> <flag name='spiceport'/> <flag name='usb-kbd'/> <flag name='host-pci-multidomain'/> <flag name='msg-timestamp'/> <flag name='active-commit'/> <flag name='change-backing-file'/> <flag name='memory-backend-ram'/> <flag name='numa'/> <flag name='memory-backend-file'/> <flag name='rtc-reset-reinjection'/> <flag name='splash-timeout'/> <flag name='iothread'/> <flag name='migrate-rdma'/> <flag name='ivshmem'/> <flag name='drive-iotune-max'/> <flag name='VGA.vgamem_mb'/> <flag name='qxl.vgamem_mb'/> <flag name='qxl-vga.vgamem_mb'/> <flag name='pc-dimm'/> <flag name='machine-vmport-opt'/> <flag name='aes-key-wrap'/> <flag name='dea-key-wrap'/> <flag name='pci-serial'/> <flag name='vhost-user-multiqueue'/> <flag name='migration-event'/> <flag name='ioh3420'/> <flag name='x3130-upstream'/> <flag name='xio3130-downstream'/> <flag name='rtl8139'/> <flag name='e1000'/> <flag name='virtio-net'/> <flag name='gic-version'/> <flag name='incoming-defer'/> <flag name='virtio-gpu'/> <flag name='virtio-keyboard'/> <flag name='virtio-mouse'/> <flag name='virtio-tablet'/> <flag name='virtio-input-host'/> <flag name='chardev-file-append'/> <flag name='ich9-disable-s3'/> <flag name='ich9-disable-s4'/> <flag name='vserport-change-event'/> <flag name='virtio-balloon-pci.deflate-on-oom'/> <flag name='qxl.vram64_size_mb'/> <flag name='qxl-vga.vram64_size_mb'/> <flag name='chardev-logfile'/> <flag name='debug-threads'/> <flag name='secret'/> <flag name='pxb'/> <flag name='pxb-pcie'/> <flag name='device-tray-moved-event'/> <flag name='nec-usb-xhci-ports'/> <flag name='virtio-scsi-pci.iothread'/> <flag name='name-guest'/> <flag name='drive-detect-zeroes'/> <flag name='tls-creds-x509'/> <flag name='intel-iommu'/> <flag name='smm'/> <flag name='query-hotpluggable-cpus'/> <flag name='virtio-net.rx_queue_size'/> <flag name='ivshmem-plain'/> <flag name='ivshmem-doorbell'/> </qemuCaps> <devices> <device alias='virtio-disk1'/> <device alias='virtio-disk0'/> <device alias='video0'/> <device alias='serial1'/> <device alias='serial0'/> <device alias='balloon0'/> <device alias='net0'/> <device alias='input0'/> <device alias='usb'/> </devices> <libDir path='/var/lib/libvirt/qemu/domain-20-instance-0000008c'/> <channelTargetDir path='/var/lib/libvirt/qemu/channel/target/domain-20-instance-0000008c'/> <domain type='kvm' id='20'> <name>instance-0000008c</name> <uuid>4c7a067d-6c06-49c7-a675-802730b00b90</uuid> <metadata> <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0"> <nova:package version="14.0.3-9.el7ost"/> <nova:name>cinder-test</nova:name> <nova:creationTime>2017-06-28 14:05:28</nova:creationTime> <nova:flavor name="m1.medium"> <nova:memory>4096</nova:memory> <nova:disk>40</nova:disk> <nova:swap>0</nova:swap> <nova:ephemeral>0</nova:ephemeral> <nova:vcpus>2</nova:vcpus> </nova:flavor> <nova:owner> <nova:user uuid="aa14d599c8694e1b859c3ac0bdf3d499">admin</nova:user> <nova:project uuid="4726e3735d18445d9f0816fc7849e33e">admin</nova:project> </nova:owner> <nova:root type="image" uuid="de627974-d900-4b71-bdd6-5b1138012422"/> </nova:instance> </metadata> <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> <vcpu placement='static'>2</vcpu> <cputune> <shares>2048</shares> </cputune> <resource> <partition>/machine</partition> </resource> <sysinfo type='smbios'> <system> <entry name='manufacturer'>Red Hat</entry> <entry name='product'>OpenStack Compute</entry> <entry name='version'>14.0.3-9.el7ost</entry> <entry name='serial'>f6154140-1f78-4bb9-8c52-f3c7ba5bedd3</entry> <entry name='uuid'>4c7a067d-6c06-49c7-a675-802730b00b90</entry> <entry name='family'>Virtual Machine</entry> </system> </sysinfo> <os> <type arch='x86_64' machine='pc-i440fx-rhel7.3.0'>hvm</type> <boot dev='hd'/> <smbios mode='sysinfo'/> </os> <features> <acpi/> <apic/> </features> <cpu mode='host-model'> <model fallback='allow'/> <topology sockets='2' cores='1' threads='1'/> </cpu> <clock offset='utc'> <timer name='pit' tickpolicy='delay'/> <timer name='rtc' tickpolicy='catchup'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <disk type='network' device='disk'> <driver name='qemu' type='raw' cache='writeback' discard='unmap'/> <auth username='openstack'> <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/> </auth> <source protocol='rbd' name='vms/4c7a067d-6c06-49c7-a675-802730b00b90_disk'> <host name='192.168.23.12' port='6789'/> <host name='192.168.23.13' port='6789'/> <host name='192.168.23.14' port='6789'/> </source> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <disk type='network' device='disk'> <driver name='qemu' type='raw' cache='writeback' discard='unmap'/> <auth username='openstack'> <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/> </auth> <source protocol='rbd' name='volumes/volume-d485683d-5581-4ce9-8e7c-6bde9d285d86'> <host name='192.168.23.12' port='6789'/> <host name='192.168.23.13' port='6789'/> <host name='192.168.23.14' port='6789'/> </source> <backingStore/> <target dev='vdb' bus='virtio'/> <serial>d485683d-5581-4ce9-8e7c-6bde9d285d86</serial> <alias name='virtio-disk1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <controller type='usb' index='0'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <interface type='bridge'> <mac address='fa:16:3e:bb:d6:6b'/> <source bridge='qbr34fccbd9-7e'/> <target dev='tap34fccbd9-7e'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='file'> <source path='/var/lib/nova/instances/4c7a067d-6c06-49c7-a675-802730b00b90/console.log'/> <target port='0'/> <alias name='serial0'/> </serial> <serial type='pty'> <source path='/dev/pts/0'/> <target port='1'/> <alias name='serial1'/> </serial> <console type='file'> <source path='/var/lib/nova/instances/4c7a067d-6c06-49c7-a675-802730b00b90/console.log'/> <target type='serial' port='0'/> <alias name='serial0'/> </console> <input type='tablet' bus='usb'> <alias name='input0'/> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'> <alias name='input1'/> </input> <input type='keyboard' bus='ps2'> <alias name='input2'/> </input> <graphics type='vnc' port='5900' autoport='yes' listen='0.0.0.0' keymap='en-us'> <listen type='address' address='0.0.0.0' fromConfig='0'/> </graphics> <video> <model type='cirrus' vram='16384' heads='1' primary='yes'/> <alias name='video0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <memballoon model='virtio'> <stats period='10'/> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </memballoon> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c373,c579</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c373,c579</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> </domain> </domstatus> [root@myr-gs-sg-ocpn-02 ~]#
(In reply to Anil Dhingra from comment #11) > Please find below also fresh sosreport on collabshell from both compute node > 1 where the VM was migrated from (node that I ran a few live migrations of > instance (4c7a067d-6c06-49c7-a675-802730b00b90) between one and two to > capture the results.... the final migration was from node 1 to node 2 > =================== > <disk type='network' device='disk'> > <driver name='qemu' type='raw' cache='writeback' discard='unmap'/> > <auth username='openstack'> > <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/> > </auth> > <source protocol='rbd' > name='vms/4c7a067d-6c06-49c7-a675-802730b00b90_disk'> > <host name='192.168.23.12' port='6789'/> > <host name='192.168.23.13' port='6789'/> > <host name='192.168.23.14' port='6789'/> > </source> > <backingStore/> > <target dev='vda' bus='virtio'/> > <alias name='virtio-disk0'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x04' > function='0x0'/> > </disk> > <disk type='network' device='disk'> > <driver name='qemu' type='raw' cache='writeback' discard='unmap'/> > <auth username='openstack'> > <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/> > </auth> > <source protocol='rbd' > name='volumes/volume-d485683d-5581-4ce9-8e7c-6bde9d285d86'> > <host name='192.168.23.12' port='6789'/> > <host name='192.168.23.13' port='6789'/> > <host name='192.168.23.14' port='6789'/> > </source> > <backingStore/> > <target dev='vdb' bus='virtio'/> > <serial>d485683d-5581-4ce9-8e7c-6bde9d285d86</serial> > <alias name='virtio-disk1'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x05' > function='0x0'/> > </disk> Ok, both the Nova and Cinder disks here are pointing to the exact same RBD server, and have identical settings. The only potentially significant difference is that one disk is in the "vms" pool and one in the "volumes" pool. So I/O is potentially serviced by different underlying storage in the RBD server. It is hard to see why use of a different storage pool would have an effect on performance only after migration. It is also hard to see why a QEMU bug would only affect 1 disk after migration though, given the identical configs.
another test ========= instead of rebooting, if I umount the cinder volume disk, detach the disk from the instance, and reattach it (all this on live migrated server where the performance has already degraded in cinder only, w/o rebooting the server), it fixes the latency after reattachment. That's why I feel it's related to Cinder only, and something after live migration triggers this oddity. References below: 1. ioping results after live migration (fio already shared in my last comment): [root@cinder-test ~]# cd /mnt/ [root@cinder-test mnt]# ioping -c 10 . 4 KiB <<< . (xfs /dev/vdb): request=1 time=833.4 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=1.30 ms 4 KiB <<< . (xfs /dev/vdb): request=3 time=1.17 ms 4 KiB <<< . (xfs /dev/vdb): request=4 time=997.2 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=965.0 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=966.3 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=997.2 us ^C --- . (xfs /dev/vdb) ioping statistics --- 6 requests completed in 6.40 ms, 24 KiB read, 937 iops, 3.66 MiB/s generated 7 requests in 6.40 s, 28 KiB, 1 iops, 4.38 KiB/s min/avg/max/mdev = 965.0 us / 1.07 ms / 1.30 ms / 125.9 us 2. umount the disk [root@cinder-test mnt]# cd [root@cinder-test ~]# umount /mnt/ [root@cinder-test ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 40G 1.3G 39G 4% / devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 17M 1.9G 1% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup tmpfs 380M 0 380M 0% /run/user/1000 3. Detach the vol from Horizon and confirm on VM: [root@cinder-test ~]# cat /proc/partitions major minor #blocks name 253 0 41943040 vda 253 1 41935052 vda1 4. Reattach the vol from Horizon and re-confirm availability: [root@cinder-test ~]# cat /proc/partitions major minor #blocks name 253 0 41943040 vda 253 1 41935052 vda1 253 16 41943040 vdb 5. Mount the disk at /mnt/ again: [root@cinder-test ~]# mount -a [root@cinder-test ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 40G 1.3G 39G 4% / devtmpfs 1.9G 0 1.9G 0% /dev tmpfs 1.9G 0 1.9G 0% /dev/shm tmpfs 1.9G 17M 1.9G 1% /run tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup tmpfs 380M 0 380M 0% /run/user/1000 /dev/vdb 40G 33M 40G 1% /mnt <---------------------- cinder disk 6. Run ioping test: [root@cinder-test ~]# cd /mnt/ [root@cinder-test mnt]# ioping -c 10 . 4 KiB <<< . (xfs /dev/vdb): request=1 time=124.3 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=311.7 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=211.6 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=251.5 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=180.3 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=179.7 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=206.1 us 4 KiB <<< . (xfs /dev/vdb): request=8 time=266.7 us 4 KiB <<< . (xfs /dev/vdb): request=9 time=301.3 us 4 KiB <<< . (xfs /dev/vdb): request=10 time=182.5 us (fast) --- . (xfs /dev/vdb) ioping statistics --- 9 requests completed in 2.09 ms, 36 KiB read, 4.30 k iops, 16.8 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 179.7 us / 232.4 us / 311.7 us / 49.1 us 7. Run fio test: [root@cinder-test mnt]# fio rand-write.fio writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1 fio-2.2.8 Starting 1 process Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/12928KB/0KB /s] [0/25.9K/0 iops] [eta 00m:00s] writes: (groupid=0, jobs=1): err= 0: pid=18981: Wed Jun 28 09:49:44 2017 write: io=389380KB, bw=12994KB/s, iops=25987, runt= 29967msec slat (usec): min=4, max=319, avg= 5.98, stdev= 1.89 clat (usec): min=10, max=7595, avg=30.79, stdev=15.24 lat (usec): min=33, max=7600, avg=36.98, stdev=15.37 clat percentiles (usec): | 1.00th=[ 29], 5.00th=[ 29], 10.00th=[ 30], 20.00th=[ 30], | 30.00th=[ 30], 40.00th=[ 30], 50.00th=[ 30], 60.00th=[ 31], | 70.00th=[ 31], 80.00th=[ 31], 90.00th=[ 31], 95.00th=[ 33], | 99.00th=[ 38], 99.50th=[ 41], 99.90th=[ 58], 99.95th=[ 67], | 99.99th=[ 262] lat (usec) : 20=0.01%, 50=99.81%, 100=0.17%, 250=0.01%, 500=0.01% lat (usec) : 750=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.01% cpu : usr=3.53%, sys=21.17%, ctx=778768, majf=0, minf=31 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued : total=r=0/w=778760/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: io=389380KB, aggrb=12993KB/s, minb=12993KB/s, maxb=12993KB/s, mint=29967msec, maxt=29967msec Disk stats (read/write): vdb: ios=0/776431, merge=0/3, ticks=0/22695, in_queue=22601, util=75.33%
In comment #3 I mentioned: "With the RBD backend Cinder service has no affect on the live migration since both the attach/detach and initialize/terminate connection are basically NOOPs." To make it clear for everyone here is the actual code: Attach and detach are not defined in the RBD driver, so the code used is from the base class: def attach_volume(self, context, volume, instance_uuid, host_name, mountpoint): """Callback for volume attached to instance or host.""" pass def detach_volume(self, context, volume, attachment=None): """Callback for volume detached.""" pass Then the code to create and remove the exports do nothing as well: def create_export(self, context, volume, connector): """Exports the volume.""" pass def remove_export(self, context, volume): """Removes an export for a logical volume.""" pass And finally the initialize and terminate connection code, which basically dumps the monitors state and pass it along with the configuration: def initialize_connection(self, volume, connector): hosts, ports = self._get_mon_addrs() data = { 'driver_volume_type': 'rbd', 'data': { 'name': '%s/%s' % (self.configuration.rbd_pool, volume.name), 'hosts': hosts, 'ports': ports, 'cluster_name': self.configuration.rbd_cluster_name, 'auth_enabled': (self.configuration.rbd_user is not None), 'auth_username': self.configuration.rbd_user, 'secret_type': 'ceph', 'secret_uuid': self.configuration.rbd_secret_uuid, 'volume_id': volume.id, "discard": True, } } LOG.debug('connection data: %s', data) return data def terminate_connection(self, volume, connector, **kwargs): pass As you can see, there's not much in terms of what Cinder is doing there, and sure Cinder could be reporting different monitors or in different order, but I checked the "hosts" and "ports" from the connection information that Cinder is returning in the logs I had here from the other day and they are all the same in all entries: grep -RnI '"driver_volume_type": "rbd"' */var/log/nova/*.log Which gives us: hosts: ["192.168.23.12", "192.168.23.13", "192.168.23.14"] ports: ["6789", "6789", "6789"] So I reiterate my original assessment, this is not a Cinder problem.
Thanks, Gorka for the Cinder-level analysis. Meanwhile, Stefan Hajnoczi (QEMU I/O)experts looked at the `fio` results, and he writes on IRC: "The `fio` results [from https://bugzilla.redhat.com/show_bug.cgi?id=1463897#c9] show that the Cinder disk is affected much more by tail latencies (99th percentile, etc). When you compare Nova vs. Cinder before migration they have similar 50th percentile latency. Post-migration the Cinder disk is suddenly experiencing ~98th percentile latency *all the time*. So it becomes super-slow." So, the next suggested step is tracing 'librbd' write and request completion, and try either with `perf` probes: perf probe -x /usr/lib64/librbd.so rbd_aio_write perf probe -x /usr/lib64/librbd.so rbd_aio_read This requires: (1) The Kernel be compiled with: CONFIG_UPROBE=y CONFIG_UPROBE_EVENTS=y (2) And have debug RPMs installed for librbd.so. Jason Dillaman, from the Ceph project also suggested (over e-mail) to also perform some performance testing with 'librbd' cache disabled. Perhaps with something like the following in ceph.conf: [client] rbd cache = false Another approach, Jason suggested, as opposed to `perf` is the following: You can just use LTTng-UST directly with a standard kernel by adding "rbd tracing = true" in your ceph.conf [1] and use "rbd-replay-prep --verbose" to dump all the traced events. [1] http://docs.ceph.com/docs/master/rbd/rbd-replay/
(In reply to Kashyap Chamarthy from comment #15) > Thanks, Gorka for the Cinder-level analysis. > > > Meanwhile, Stefan Hajnoczi (QEMU I/O)experts looked at the `fio` > results, and he writes on IRC: > > "The `fio` results [from > https://bugzilla.redhat.com/show_bug.cgi?id=1463897#c9] show that > the Cinder disk is affected much more by tail latencies (99th > percentile, etc). When you compare Nova vs. Cinder before migration > they have similar 50th percentile latency. Post-migration the > Cinder disk is suddenly experiencing ~98th percentile latency *all > the time*. So it becomes super-slow." > > So, the next suggested step is tracing 'librbd' write and request > completion, and try either with `perf` probes: > > perf probe -x /usr/lib64/librbd.so rbd_aio_write > perf probe -x /usr/lib64/librbd.so rbd_aio_read > > This requires: > > (1) The Kernel be compiled with: > > CONFIG_UPROBE=y > CONFIG_UPROBE_EVENTS=y The RHEL 7.3 kernel has these config options enabled. It is not necessary to compile a custom kernel.
Thanks, Stefan. Indeed, I just meant to imply that the Kernel has those probes configured for completeness' sake. Test ---- (1) Add the probes: $ perf probe -x /usr/lib64/librbd.so.1 rbd_aio_read $ perf probe -x /usr/lib64/librbd.so.1 rbd_aio_write $ perf probe -x /usr/lib64/librbd.so.1 rbd_aio_release (2) Add everything (read, write, and completion) onto a single comand-line: $ perf record -a \ -e probe_librbd:rbd_aio_read \ -e probe_librbd:rbd_aio_write \ -e probe_librbd:rbd_aio_release (3) Then run a single O_DIRECT sequential I/O benchmark (e.g.reads) inside the guest: $ dd if=/dev/vda of=/dev/null iflag=direct count=50000 (4) Workout a script to calculate the request latencies from the perf-script(1) output. Perform the above test for all the four combinations: (a) with & without RBD cache enabled; (b) pre- and post-migration. Now, there are four tests to be done: - Test-1: [Pre-migration] On source Compute node, do `perf-probe` test, with RBD cache enabled Example result for this: http://file.emea.redhat.com/~kchamart/Nova-bug-1463897/output-all-three-probes-sequential-reads-pre-migration-RBD-cache-enabled.txt - Test-2: [Pre-migration] On source Compute node, do `perf-probe` test, with RBD cache disabled - Test-3: [Post-migration] On destination Compute node, do `perf-probe` test with RBD cache enabled - Test-4: [Post-migration] On destination Compute node, do `perf-probe` test with RBD cache disabled
tried to replicate this is in physical lab 1Ctrl + 2 compute + 3 ceph node with 1 Os disk + 1 osd disk & results were same after shutdown / power on VM result was better but without shutdown its was almost 75% degraded . Let me try do perf test also in LAB If you need setup details please do let me know - i can keep this setup for 1-2 days . Added new device & ioping test result approx 22 MB ~~~~~~~~~~~~~~ [root@vm1 ~]# mkdir /test [root@vm1 ~]# mount /dev/vdb /test [root@vm1 ~]# [root@vm1 ~]# cd /test [root@vm1 test]# ioping . 4 KiB <<< . (xfs /dev/vdb): request=1 time=124.0 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=235.9 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=145.1 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=120.3 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=145.8 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=177.8 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=137.9 us 4 KiB <<< . (xfs /dev/vdb): request=8 time=147.0 us 4 KiB <<< . (xfs /dev/vdb): request=9 time=133.5 us 4 KiB <<< . (xfs /dev/vdb): request=10 time=237.6 us (slow) 4 KiB <<< . (xfs /dev/vdb): request=11 time=216.2 us 4 KiB <<< . (xfs /dev/vdb): request=12 time=238.9 us (slow) 4 KiB <<< . (xfs /dev/vdb): request=13 time=105.6 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=14 time=143.7 us 4 KiB <<< . (xfs /dev/vdb): request=15 time=231.5 us ^C --- . (xfs /dev/vdb) ioping statistics --- 14 requests completed in 2.42 ms, 56 KiB read, 5.79 k iops, 22.6 MiB/s generated 15 requests in 14.5 s, 60 KiB, 1 iops, 4.14 KiB/s min/avg/max/mdev = 105.6 us / 172.6 us / 238.9 us / 47.0 us ~~~~~~~~ below during live migration ~~~~~~ [root@vm1 test]# ioping . 4 KiB <<< . (xfs /dev/vdb): request=1 time=82.8 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=234.6 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=439.3 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=265.8 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=259.4 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=232.5 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=207.5 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=8 time=260.3 us 4 KiB <<< . (xfs /dev/vdb): request=9 time=270.5 us 4 KiB <<< . (xfs /dev/vdb): request=10 time=1.28 ms (slow) 4 KiB <<< . (xfs /dev/vdb): request=11 time=970.4 us 4 KiB <<< . (xfs /dev/vdb): request=12 time=859.2 us 4 KiB <<< . (xfs /dev/vdb): request=13 time=899.6 us 4 KiB <<< . (xfs /dev/vdb): request=14 time=880.2 us 4 KiB <<< . (xfs /dev/vdb): request=15 time=918.4 us 4 KiB <<< . (xfs /dev/vdb): request=16 time=812.6 us 4 KiB <<< . (xfs /dev/vdb): request=17 time=909.3 us 4 KiB <<< . (xfs /dev/vdb): request=18 time=788.1 us 4 KiB <<< . (xfs /dev/vdb): request=19 time=870.0 us 4 KiB <<< . (xfs /dev/vdb): request=20 time=801.0 us 4 KiB <<< . (xfs /dev/vdb): request=21 time=953.8 us 4 KiB <<< . (xfs /dev/vdb): request=22 time=826.5 us 4 KiB <<< . (xfs /dev/vdb): request=23 time=1.05 ms 4 KiB <<< . (xfs /dev/vdb): request=24 time=855.8 us 4 KiB <<< . (xfs /dev/vdb): request=25 time=870.9 us 4 KiB <<< . (xfs /dev/vdb): request=26 time=778.8 us 4 KiB <<< . (xfs /dev/vdb): request=27 time=1.07 ms 4 KiB <<< . (xfs /dev/vdb): request=28 time=831.7 us 4 KiB <<< . (xfs /dev/vdb): request=29 time=877.2 us 4 KiB <<< . (xfs /dev/vdb): request=30 time=939.2 us 4 KiB <<< . (xfs /dev/vdb): request=31 time=1.09 ms 4 KiB <<< . (xfs /dev/vdb): request=32 time=949.5 us 4 KiB <<< . (xfs /dev/vdb): request=33 time=992.6 us 4 KiB <<< . (xfs /dev/vdb): request=34 time=948.4 us 4 KiB <<< . (xfs /dev/vdb): request=35 time=956.6 us 4 KiB <<< . (xfs /dev/vdb): request=36 time=1.10 ms 4 KiB <<< . (xfs /dev/vdb): request=37 time=914.6 us 4 KiB <<< . (xfs /dev/vdb): request=38 time=925.5 us 4 KiB <<< . (xfs /dev/vdb): request=39 time=938.7 us 4 KiB <<< . (xfs /dev/vdb): request=40 time=961.5 us 4 KiB <<< . (xfs /dev/vdb): request=41 time=910.6 us 4 KiB <<< . (xfs /dev/vdb): request=42 time=918.7 us 4 KiB <<< . (xfs /dev/vdb): request=43 time=1.11 ms 4 KiB <<< . (xfs /dev/vdb): request=44 time=1.02 ms 4 KiB <<< . (xfs /dev/vdb): request=45 time=1.01 ms 4 KiB <<< . (xfs /dev/vdb): request=46 time=974.3 us 4 KiB <<< . (xfs /dev/vdb): request=47 time=1.04 ms 4 KiB <<< . (xfs /dev/vdb): request=48 time=1.11 ms 4 KiB <<< . (xfs /dev/vdb): request=49 time=940.2 us 4 KiB <<< . (xfs /dev/vdb): request=50 time=948.7 us 4 KiB <<< . (xfs /dev/vdb): request=51 time=1.04 ms 4 KiB <<< . (xfs /dev/vdb): request=52 time=974.6 us 4 KiB <<< . (xfs /dev/vdb): request=53 time=932.9 us 4 KiB <<< . (xfs /dev/vdb): request=54 time=825.3 us 4 KiB <<< . (xfs /dev/vdb): request=55 time=934.0 us 4 KiB <<< . (xfs /dev/vdb): request=56 time=1.09 ms 4 KiB <<< . (xfs /dev/vdb): request=57 time=1.03 ms 4 KiB <<< . (xfs /dev/vdb): request=58 time=935.7 us 4 KiB <<< . (xfs /dev/vdb): request=59 time=1.13 ms 4 KiB <<< . (xfs /dev/vdb): request=60 time=825.3 us 4 KiB <<< . (xfs /dev/vdb): request=61 time=965.7 us 4 KiB <<< . (xfs /dev/vdb): request=62 time=811.6 us 4 KiB <<< . (xfs /dev/vdb): request=63 time=891.3 us 4 KiB <<< . (xfs /dev/vdb): request=64 time=887.2 us 4 KiB <<< . (xfs /dev/vdb): request=65 time=833.5 us 4 KiB <<< . (xfs /dev/vdb): request=66 time=798.7 us 4 KiB <<< . (xfs /dev/vdb): request=67 time=1.10 ms 4 KiB <<< . (xfs /dev/vdb): request=68 time=932.7 us 4 KiB <<< . (xfs /dev/vdb): request=69 time=871.5 us 4 KiB <<< . (xfs /dev/vdb): request=70 time=704.2 us 4 KiB <<< . (xfs /dev/vdb): request=71 time=885.0 us 4 KiB <<< . (xfs /dev/vdb): request=72 time=867.3 us 4 KiB <<< . (xfs /dev/vdb): request=73 time=1.03 ms 4 KiB <<< . (xfs /dev/vdb): request=74 time=889 us 4 KiB <<< . (xfs /dev/vdb): request=75 time=993.1 us 4 KiB <<< . (xfs /dev/vdb): request=76 time=906.6 us 4 KiB <<< . (xfs /dev/vdb): request=77 time=1.04 ms 4 KiB <<< . (xfs /dev/vdb): request=78 time=765.8 us 4 KiB <<< . (xfs /dev/vdb): request=79 time=836.8 us 4 KiB <<< . (xfs /dev/vdb): request=80 time=783.8 us 4 KiB <<< . (xfs /dev/vdb): request=81 time=807.3 us 4 KiB <<< . (xfs /dev/vdb): request=82 time=744.9 us 4 KiB <<< . (xfs /dev/vdb): request=83 time=835.4 us 4 KiB <<< . (xfs /dev/vdb): request=84 time=722.7 us 4 KiB <<< . (xfs /dev/vdb): request=85 time=762.4 us 4 KiB <<< . (xfs /dev/vdb): request=86 time=774.9 us 4 KiB <<< . (xfs /dev/vdb): request=87 time=913.1 us 4 KiB <<< . (xfs /dev/vdb): request=88 time=881.3 us 4 KiB <<< . (xfs /dev/vdb): request=89 time=883.9 us 4 KiB <<< . (xfs /dev/vdb): request=90 time=780.9 us 4 KiB <<< . (xfs /dev/vdb): request=91 time=830.9 us 4 KiB <<< . (xfs /dev/vdb): request=92 time=897.6 us 4 KiB <<< . (xfs /dev/vdb): request=93 time=874.8 us 4 KiB <<< . (xfs /dev/vdb): request=94 time=814.1 us ^C --- . (xfs /dev/vdb) ioping statistics --- 93 requests completed in 80.0 ms, 372 KiB read, 1.16 k iops, 4.54 MiB/s generated 94 requests in 1.56 min, 376 KiB, 1 iops, 4.02 KiB/s min/avg/max/mdev = 207.5 us / 860.5 us / 1.28 ms / 208.3 us [root@vm1 test]# ~~~~~~~~~~ [root@overcloud-controller-0 ~]# nova show VM1 | grep -i hypervi | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-0.localdomain | [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# nova show VM1 | grep -i hypervi | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-1.localdomain | [root@overcloud-controller-0 ~]# ~~~~~~~~~ ~~~ tested again same result still near to 4 MB~~~ [root@vm1 test]# ioping . 4 KiB <<< . (xfs /dev/vdb): request=1 time=916.6 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=999.0 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=957.0 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=915.3 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=893.7 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=912.1 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=1.03 ms (slow) 4 KiB <<< . (xfs /dev/vdb): request=8 time=990.6 us 4 KiB <<< . (xfs /dev/vdb): request=9 time=1.03 ms (slow) 4 KiB <<< . (xfs /dev/vdb): request=10 time=887.2 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=11 time=949.0 us 4 KiB <<< . (xfs /dev/vdb): request=12 time=922.3 us 4 KiB <<< . (xfs /dev/vdb): request=13 time=1.12 ms (slow) 4 KiB <<< . (xfs /dev/vdb): request=14 time=917.6 us 4 KiB <<< . (xfs /dev/vdb): request=15 time=1.04 ms 4 KiB <<< . (xfs /dev/vdb): request=16 time=900.0 us 4 KiB <<< . (xfs /dev/vdb): request=17 time=987.8 us 4 KiB <<< . (xfs /dev/vdb): request=18 time=908.0 us 4 KiB <<< . (xfs /dev/vdb): request=19 time=1.03 ms 4 KiB <<< . (xfs /dev/vdb): request=20 time=921.8 us ^C --- . (xfs /dev/vdb) ioping statistics --- 19 requests completed in 18.3 ms, 76 KiB read, 1.04 k iops, 4.05 MiB/s generated 20 requests in 20.0 s, 80 KiB, 1 iops, 4.00 KiB/s min/avg/max/mdev = 887.2 us / 964.2 us / 1.12 ms / 63.3 us [root@vm1 test]# ~~~ Even after reboot no improvement~~~ [root@vm1 test]# reboot Connection to 10.0.0.9 closed by remote host. Connection to 10.0.0.9 closed. [root@overcloud-controller-0 ~]# ip netns exec qdhcp-68a81ea2-e5aa-43ee-be7d-a0eabd60672b ssh -i mykey.pem cloud-user.0.9 Last login: Wed Jul 19 02:54:39 2017 from host-10-0-0-2.openstacklocal [cloud-user@vm1 ~]$ sudo su - Last login: Wed Jul 19 02:54:43 EDT 2017 on pts/0 [root@vm1 ~]# mount /dev/vdb /test [root@vm1 ~]# cd /test [root@vm1 test]# ioping . 4 KiB <<< . (xfs /dev/vdb): request=1 time=1.04 ms (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=942.4 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=1.17 ms 4 KiB <<< . (xfs /dev/vdb): request=4 time=1.09 ms 4 KiB <<< . (xfs /dev/vdb): request=5 time=1.07 ms 4 KiB <<< . (xfs /dev/vdb): request=6 time=932.5 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=990.2 us 4 KiB <<< . (xfs /dev/vdb): request=8 time=1.00 ms 4 KiB <<< . (xfs /dev/vdb): request=9 time=1.06 ms 4 KiB <<< . (xfs /dev/vdb): request=10 time=1.10 ms 4 KiB <<< . (xfs /dev/vdb): request=11 time=931.1 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=12 time=1.12 ms 4 KiB <<< . (xfs /dev/vdb): request=13 time=1.00 ms 4 KiB <<< . (xfs /dev/vdb): request=14 time=931.1 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=15 time=1.01 ms 4 KiB <<< . (xfs /dev/vdb): request=16 time=1.17 ms (slow) 4 KiB <<< . (xfs /dev/vdb): request=17 time=1.14 ms 4 KiB <<< . (xfs /dev/vdb): request=18 time=968.0 us ^C --- . (xfs /dev/vdb) ioping statistics --- 17 requests completed in 17.6 ms, 68 KiB read, 964 iops, 3.77 MiB/s generated 18 requests in 17.1 s, 72 KiB, 1 iops, 4.20 KiB/s min/avg/max/mdev = 931.1 us / 1.04 ms / 1.17 ms / 81.2 us ~~~~~~~~ ~~~~~~~~~ Finally Did Shutdown & test result was bit close ~~~ [root@vm1 test]# shutdown -h now Connection to 10.0.0.9 closed by remote host. Connection to 10.0.0.9 closed. [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# ip netns exec qdhcp-68a81ea2-e5aa-43ee-be7d-a0eabd60672b ssh -i mykey.pem cloud-user.0.9 Last login: Wed Jul 19 03:19:44 2017 from host-10-0-0-2.openstacklocal [cloud-user@vm1 ~]$ [cloud-user@vm1 ~]$ sudo su - Last login: Wed Jul 19 03:19:48 EDT 2017 on pts/0 [root@vm1 ~]# [root@vm1 ~]# mount /dev/vdb /test [root@vm1 ~]# cd /test [root@vm1 test]# ioping . 4 KiB <<< . (xfs /dev/vdb): request=1 time=148.6 us (warmup) 4 KiB <<< . (xfs /dev/vdb): request=2 time=264.8 us 4 KiB <<< . (xfs /dev/vdb): request=3 time=270.3 us 4 KiB <<< . (xfs /dev/vdb): request=4 time=226.8 us 4 KiB <<< . (xfs /dev/vdb): request=5 time=180.9 us 4 KiB <<< . (xfs /dev/vdb): request=6 time=252.5 us 4 KiB <<< . (xfs /dev/vdb): request=7 time=179.3 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=8 time=222.9 us 4 KiB <<< . (xfs /dev/vdb): request=9 time=296.7 us (slow) 4 KiB <<< . (xfs /dev/vdb): request=10 time=200.9 us 4 KiB <<< . (xfs /dev/vdb): request=11 time=179.4 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=12 time=202.0 us 4 KiB <<< . (xfs /dev/vdb): request=13 time=192.8 us 4 KiB <<< . (xfs /dev/vdb): request=14 time=199.6 us 4 KiB <<< . (xfs /dev/vdb): request=15 time=295.5 us (slow) 4 KiB <<< . (xfs /dev/vdb): request=16 time=198.0 us 4 KiB <<< . (xfs /dev/vdb): request=17 time=178.0 us (fast) 4 KiB <<< . (xfs /dev/vdb): request=18 time=185.5 us ^C --- . (xfs /dev/vdb) ioping statistics --- 17 requests completed in 3.73 ms, 68 KiB read, 4.56 k iops, 17.8 MiB/s generated 18 requests in 17.1 s, 72 KiB, 1 iops, 4.21 KiB/s min/avg/max/mdev = 178.0 us / 219.2 us / 296.7 us / 40.1 us ************************
(In reply to Anil Dhingra from comment #19) > tried to replicate this is in physical lab 1Ctrl + 2 compute + 3 ceph node > with 1 Os disk + 1 osd disk & results were same after shutdown / power on VM > result was better but without shutdown > its was almost 75% degraded . Let me try do perf test also in LAB > > If you need setup details please do let me know - i can keep this setup for > 1-2 days . Feel free to do the tests in your lab, and see if you can reproduce the problem. And upload the `perf` results somewhere. (Just a gentle note: if you have super long outputs, attach them as plain text files.) I have my own (virtual) setup where I am trying to replicate this problem with 3 Controllers, 2 Computes, and 3 Ceph nodes. I am done with first two `perf` tests, and have the data locally: - [Pre-migration] On source Compute, with RBD cache enabled - [Pre-migration] On source Compute, with RBD cache *disabled* And I am yet to do the migration, and do next two `perf` tests: - [Post-migration] On destination Compute, with RBD cache enabled - [Post-migration] On destination Compute, with RBD cache *disabled* Then, parse them through a script I wrote to calculate the latencies between 'rbd_aio_read', 'rbd_aio_release' requests. Then I need help from a Ceph / QEMU I/O expert to help me with parsing the data. [...]
Created attachment 1300975 [details] perf test
Created attachment 1300978 [details] perf test post
Tests Done with rbd cache enable files attached ============= Pre ==== [root@vm1 test]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000 50000+0 records in 50000+0 records out 25600000 bytes (26 MB) copied, 1.92778 s, 13.3 MB/s Post migration took longer time ====== [root@vm1 test]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000 50000+0 records in 50000+0 records out 25600000 bytes (26 MB) copied, 29.2381 s, 876 kB/s Source ====== [root@overcloud-compute-0 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release ^C[ perf record: Woken up 29 times to write data ] [ perf record: Captured and wrote 8.101 MB perf.data (100000 samples) ] [root@overcloud-compute-0 ~]# mv perf.data perf.data-with-cache-pre Destination ========== [root@overcloud-compute-1 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release ^C[ perf record: Woken up 28 times to write data ] [ perf record: Captured and wrote 8.190 MB perf.data (100012 samples) ] [root@overcloud-compute-1 ~]# mv perf.data perf.data-with-cache-post ~~ will disable RBD cache now & do a fresh test ~~
Created attachment 1301093 [details] perf test without cache pre
Created attachment 1301094 [details] perf test without cache post
Below test with disabled rbd cache ====== [heat-admin@overcloud-compute-1 ~]$ uptime 12:00:36 up 4 min, 1 user, load average: 0.03, 0.15, 0.08 [heat-admin@overcloud-compute-1 ~]$ tail -2 /etc/ceph/ceph.conf [client] rbd_cache = false [heat-admin@overcloud-compute-1 ~]$ Pre ~~~~ [root@overcloud-compute-1 ~]# virsh list Id Name State ---------------------------------------------------- 1 instance-00000004 running [root@overcloud-compute-1 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release ^C[ perf record: Woken up 28 times to write data ] [ perf record: Captured and wrote 8.066 MB perf.data (100000 samples) ] [root@overcloud-compute-1 ~]# mv perf.data perf.data-without-cache-pre [root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000 50000+0 records in 50000+0 records out 25600000 bytes (26 MB) copied, 1.96597 s, 13.0 MB/s ********* [root@overcloud-compute-0 ~]# uptime 12:00:24 up 4 min, 1 user, load average: 0.02, 0.11, 0.05 [root@overcloud-compute-0 ~]# tail -2 /etc/ceph/ceph.conf [client] rbd_cache = false post ~~~~ [root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000 50000+0 records in 50000+0 records out 25600000 bytes (26 MB) copied, 26.9762 s, 949 kB/s [root@vm1 ~]# [root@overcloud-compute-0 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release ^C[ perf record: Woken up 26 times to write data ] [ perf record: Captured and wrote 8.127 MB perf.data (100000 samples) ] [root@overcloud-compute-0 ~]# mv perf.data perf.data-without-cache-post
I have calculated the following librbd read request latencies from Anil's perf data. The dd mean latency is also included (calculated from Anil's dd output) for comparison - this shows that the perf results are sane. The dd latency is librbd latency + QEMU + guest kernel latency. Cache Pre/Post Median Mean dd (us) (us) (us) -------------------------------------- On Pre 12 13 39 On Post 494 501 585 Off Pre 12 13 39 Off Post 461 472 539 Two observations: 1. librbd is much slower after migration. This indicates that the issue is at the Ceph level, not due to QEMU or the guest. 2. Cache on/off makes no difference. This is odd. The block size is 512 bytes and only 26 MB are read. Therefore cache on should perform well while cache off should be slower.
Created attachment 1301289 [details] Latency analysis script Example usage: $ perf -i perf.data -s perf-latency.py Think Time Mean: 0.000025 Median: 0.000025 Min: 0.000023 Max: 0.000147 Histogram bucket size: 0.000112 0.000079: 49998 0.000191: 1 0.000304: 0 0.000416: 0 0.000529: 0 0.000641: 0 0.000754: 0 0.000866: 0 0.000979: 0 0.001091: 0 Read Latency Mean: 0.000013 Median: 0.000012 Min: 0.000011 Max: 0.000521 Histogram bucket size: 0.000151 0.000087: 49997 0.000238: 1 0.000389: 1 0.000540: 1 0.000691: 0 0.000842: 0 0.000993: 0 0.001144: 0 0.001295: 0 0.001446: 0
Hi Jason, Given Stefan's observations (about the `perf` data for the four tests (noted in earlier comments) in comment#28, and comment#29, it appears to be Ceph is the problem. Do you, or other Ceph team members have any advise here?
(In reply to Kashyap Chamarthy from comment #31) > Hi Jason, > > Given Stefan's observations (about the `perf` data for the four tests (noted > in earlier comments) in comment#28, and comment#29, it appears to be Ceph is > the problem. Do you, or other Ceph team members have any advise here? At the moment we can not say "it appears to be Ceph is the problem" explicitly. We need to do same tests with other backend.
Created attachment 1301672 [details] Perf test results for the four test cases (pre- and post-migration, with and without RBD cache enabled The result are after running the raw `perf` data through Stefan's script perf-latency.py from comment#29. So the attached text file contains the result of the four data sets ran through: [compute-0]$ perf script -i pre-mig-src-cache-yes.data -s perf-latency.py [compute-0]$ perf script -i pre-mig-src-cache-no.data -s perf-latency.py [compute-1]$ perf script -i post-mig-src-cache-no.data -s perf-latency.py [compute-1]$ perf script -i post-mig-src-cache-yes.data -s perf-latency.py
(In reply to Shinobu KINJO from comment #32) > (In reply to Kashyap Chamarthy from comment #31) > > Hi Jason, > > > > Given Stefan's observations (about the `perf` data for the four tests (noted > > in earlier comments) in comment#28, and comment#29, it appears to be Ceph is > > the problem. Do you, or other Ceph team members have any advise here? > > At the moment we can not say "it appears to be Ceph is the problem" > explicitly. > We need to do same tests with other backend. The performance degradation observed by Anil *is* at the Ceph level. The results in Comment #28 show that librbd's read latency increases after migration. Once the guest and QEMU submit a request to librbd they are no longer involved until that request completes. This is what Anil measured. Therefore the next step is to focus on librbd and Ceph to see why read requests take longer after migration. It's the next logical place to investigate because the results show that the latency degradation does not happen in the guest or QEMU.
(In reply to Kashyap Chamarthy from comment #34) > Created attachment 1301672 [details] > Perf test results for the four test cases (pre- and post-migration, with and > without RBD cache enabled > > The result are after running the raw `perf` data through Stefan's script > perf-latency.py from comment#29. > > So the attached text file contains the result of the four data sets ran > through: > > [compute-0]$ perf script -i pre-mig-src-cache-yes.data -s perf-latency.py > [compute-0]$ perf script -i pre-mig-src-cache-no.data -s perf-latency.py > [compute-1]$ perf script -i post-mig-src-cache-no.data -s perf-latency.py > [compute-1]$ perf script -i post-mig-src-cache-yes.data -s perf-latency.py Here is a summary of what you've posted: Cache Pre/Post Median Mean dd (us) (us) (us) -------------------------------------- On Pre 695 725 ? On Post 775 802 ? Off Pre 667 712 ? Off Post 48 56 ? Similar to Anil's results there is no performance difference between cache on/off pre migration. This seems unusual for a sequential read test. The post migration result with cache disabled is bizarre. Performance was much better *after* migration? Kashyap: Please post the disk I/O benchmark output from these runs so these numbers can be sanity checked. It's clear from the number of I/O requests that you ran a different workload from Anil, but the details are important. At this point I would say your results do not show a performance degradation after live migration in librbd.
After migration, the rbd cache for the Cinder volume is disabled: *PRE-MIGRATION* qemu 138779 1 99 14:42 ? 00:00:01 /usr/libexec/qemu-kvm -name guest=instance-00000004,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-12-instance-00000004/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 610b4ffd-5b36-467e-9f59-a718f2adc674 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.6-2.el7ost,serial=38873f97-1eae-4dc9-a74a-71b078bb59c9,uuid=610b4ffd-5b36-467e-9f59-a718f2adc674,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-12-instance-00000004/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=hHm8kz+svBrnCmXp4oeuXFWx15+mSpKtLMKkzoTij2Y=,keyid=masterKey0,iv=q2IMnNhkk673f3rQYNEIeg==,format=base64 -drive file=rbd:vms/610b4ffd-5b36-467e-9f59-a718f2adc674_disk:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -object secret,id=virtio-disk1-secret0,data=FYWo2AMkPmp5BGS6BMQn6LBvp6IIuamaaA5vCKx/sTU=,keyid=masterKey0,iv=YoJFXG+EDhGSCKwWVwxrJA==,format=base64 -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:18:0d:aa,bus=pci.0,addr=0x3 -add-fd set=2,fd=31 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.74.160.8:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on ## -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=writeback ** POST-MIGRATION ** qemu 135842 1 1 14:09 ? 00:00:30 /usr/libexec/qemu-kvm -name guest=instance-00000004,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-11-instance-00000004/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 610b4ffd-5b36-467e-9f59-a718f2adc674 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.6-2.el7ost,serial=38873f97-1eae-4dc9-a74a-71b078bb59c9,uuid=610b4ffd-5b36-467e-9f59-a718f2adc674,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-11-instance-00000004/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=5mU5udvfMsCiRB8giamW+f5ECusfzCXHn+jwE0c5ldI=,keyid=masterKey0,iv=vOlnzyPWhQk2Nm1URL0Saw==,format=base64 -drive file=rbd:vms/610b4ffd-5b36-467e-9f59-a718f2adc674_disk:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -object secret,id=virtio-disk1-secret0,data=mAx00CIXnxK7DoOaQfor6yah3m2HMCVkT8eIKIvmxIw=,keyid=masterKey0,iv=ICdwnOp9R0AwNzKL8BWf+Q==,format=base64 -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:18:0d:aa,bus=pci.0,addr=0x3 -add-fd set=2,fd=32 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.74.160.8:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on ## -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=none Looks like perhaps libvirt/QEMU/Nova is dropping the Cinder cache setting (note that the Nova ephemeral volume retains its writeback cache setting after migration).
I also enabled Ceph admin sockets and logging and confirmed that cache was enabled pre-migration and disabled post-migration on the Cinder volume: ** PRE-MIGRATION ** # ceph --admin-daemon /var/run/qemu/ceph-client.openstack.138779.140668137822720.asok config get rbd_cache { "rbd_cache": "true" } ** POST-MIGRATION ** # ceph --admin-daemon /var/run/qemu/ceph-client.openstack.135842.139761430051328.asok config get rbd_cache { "rbd_cache": "false" }
Nice catch, Jason! So, from a chat with Daniel Berrangé on IRC, from a Nova-perspective: "Prior to live migration, Nova rewrites all the <disk> elements, and passes this updated guest XML across to target libvirt. And it is never calling _set_disk_cache() when doing this. So `nova.conf`'s `writeback` setting is getting lost, leaving the default `cache=none` setting. And this mistake (of leaving the default cache value to 'none') will of course be correct when you reboot the guest on the target later." So we need to set the cache mode in _get_volume_config() method in Nova -- because it is a callback function to _update_volume_xml() in nova/virt/libvirt/migration.py. To that effect, here's a Nova upstream change: https://review.openstack.org/#/c/485752/ -- libvirt/driver.py: Set cache value for Cinder volume post-migration
(In reply to Jason Dillaman from comment #42) > I also enabled Ceph admin sockets and logging and confirmed that cache was > enabled pre-migration and disabled post-migration on the Cinder volume: > > ** PRE-MIGRATION ** > > # ceph --admin-daemon > /var/run/qemu/ceph-client.openstack.138779.140668137822720.asok config get > rbd_cache > { > "rbd_cache": "true" > } > > > ** POST-MIGRATION ** > > # ceph --admin-daemon > /var/run/qemu/ceph-client.openstack.135842.139761430051328.asok config get > rbd_cache > { > "rbd_cache": "false" > } That's good catch. But different condition of < rbd_cache > does not contribute to the performance degradation. The followings are test case with enabling < rbd_cache > on both source and destination hosts explicitly. // *** SOURCE HOST *** // *** Condition of < rbd_cache > *** [root@overcloud-compute-0 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.184066.140268954968064.asok config get rbd_cache { "rbd_cache": "true" } // *** DO DD *** [root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=5000 5000+0 records in 5000+0 records out 2560000 bytes (2.6 MB) copied, 0.191401 s, 13.4 MB/s // *** DO MIGRATION *** // *** DESTINATION *** // *** Condition of < rbd_cache > *** [root@overcloud-compute-1 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.181268.139699450011648.asok config get rbd_cache { "rbd_cache": "true" } // *** DO DD *** [root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=5000 5000+0 records in 5000+0 records out 2560000 bytes (2.6 MB) copied, 3.32248 s, 771 kB/s
@Shinobu: how did you "explicitly" enable the cache? The value in "ceph.conf" is overridden by the QEMU setting. Also, are you 100% positive you are using the correct ASOK file? There should be two files present (one for the Nova ephemeral volume and the other for Cinder). Also note that running "config set rbd_cache true" doesn't dynamically enable the cache but does change the configuration setting.
(In reply to Jason Dillaman from comment #45) > @Shinobu: how did you "explicitly" enable the cache? The value in > "ceph.conf" is overridden by the QEMU setting. Also, are you 100% positive > you are using the correct ASOK file? There should be two files present (one > for the Nova ephemeral volume and the other for Cinder). Also note that > running "config set rbd_cache true" doesn't dynamically enable the cache but > does change the configuration setting. If I use inappropriate socket file, I should not be able to get config value, should I? Is what I saw something wrong? [root@overcloud-compute-0 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.184066.140268954968064.asok config get rbd_cache { "rbd_cache": "true" } [root@overcloud-compute-1 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.181268.139699450011648.asok config get rbd_cache { "rbd_cache": "true" } // ****** [root@overcloud-compute-0 ~]# cat /etc/ceph/ceph.conf | grep cache rbd_cache = true [root@overcloud-compute-1 ~]# cat /etc/ceph/ceph.conf | grep cache rbd_cache = true
@Shinobu: you didn't really answer my question. There are two ASOK files per instance of qemu since it has two disks attached. You checked two ASOK files on two different hosts. Also, as I mentioned, the QEMU cache setting overrides what you have in your ceph.conf, so grepping that file doesn't imply anything. Regardless, I also looked at the raw logs and saw that the actual IO latency between librbd and the OSD was consistent between pre- and post- migration. That is why I looked into the cache settings and confirmed that the Cinder cache was being disabled upon migration. I recommend a re-test with the proposed fix applied to Nova.
(In reply to Jason Dillaman from comment #49) > @Shinobu: you didn't really answer my question. There are two ASOK files per > instance of qemu since it has two disks attached. You checked two ASOK files > on two different hosts. Also, as I mentioned, the QEMU cache setting > overrides what you have in your ceph.conf, so grepping that file doesn't > imply anything. > > Regardless, I also looked at the raw logs and saw that the actual IO latency > between librbd and the OSD was consistent between pre- and post- migration. > That is why I looked into the cache settings and confirmed that the Cinder > cache was being disabled upon migration. I recommend a re-test with the > proposed fix applied to Nova. That makes sense.
The patch for Git master is merged. And here's the upstream stable/newton backport, in-progress: https://review.openstack.org/#/c/488959/
Verification notes for this bug: *Without* this bug fix (from openstack-nova-14.0.8-2.el7ost), when you migrate a Nova instance with a Cinder volume -- where both Nova instance's disk and the Cinder volume are on Ceph -- the cache value for the Cinder volume (erroneously) changes from 'writeback' to 'none': [Check by doing `ps -ef | grep qemu`, and look for the relevant QEMU process associated with the Nova instance.] Pre-migration, QEMU command-line for the Nova instance: [...] -drive file=rbd:volumes/volume-[...],cache=writeback Post-migration, QEMU command-line for the Nova instance: [...] -drive file=rbd:volumes/volume-[...],cache=none *With* the bug fix (from openstack-nova-14.0.8-2.el7ost), the cache value for the Cinder volume should remain 'writeback': Pre-migration, QEMU command-line for the Nova instance: [...] -drive file=rbd:volumes/volume-[...],cache=writeback Post-migration, QEMU command-line for the Nova instance: [...] -drive file=rbd:volumes/volume-[...],cache=writeback
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2823