Bug 1463897 - I/O latency of cinder volume after live migration increases
I/O latency of cinder volume after live migration increases
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
high Severity high
: z5
: 10.0 (Newton)
Assigned To: Kashyap Chamarthy
Gabriel Szasz
: Triaged, ZStream
Depends On: 1482921
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-21 23:53 EDT by Anil Dhingra
Modified: 2017-10-28 22:45 EDT (History)
23 users (show)

See Also:
Fixed In Version: openstack-nova-14.0.8-2.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1482921 (view as bug list)
Environment:
Last Closed: 2017-09-28 12:32:44 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
perf test (8.12 MB, application/octet-stream)
2017-07-19 07:16 EDT, Anil Dhingra
no flags Details
perf test post (8.21 MB, application/octet-stream)
2017-07-19 07:18 EDT, Anil Dhingra
no flags Details
perf test without cache pre (8.08 MB, application/octet-stream)
2017-07-19 09:27 EDT, Anil Dhingra
no flags Details
perf test without cache post (8.15 MB, application/octet-stream)
2017-07-19 09:28 EDT, Anil Dhingra
no flags Details
Latency analysis script (2.95 KB, text/plain)
2017-07-19 14:14 EDT, Stefan Hajnoczi
no flags Details
Perf test results for the four test cases (pre- and post-migration, with and without RBD cache enabled (6.40 KB, text/plain)
2017-07-20 07:49 EDT, Kashyap Chamarthy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1706083 None None None 2017-07-24 10:04 EDT
OpenStack gerrit 485752 None None None 2017-07-28 04:04 EDT
OpenStack gerrit 488959 None None None 2017-07-29 11:21 EDT

  None (edit)
Description Anil Dhingra 2017-06-21 23:53:53 EDT
Description of problem:

The I/O latency of a cinder volume after live migration of an instance to which it's attached increases significantly. This stays increased till the VM is stopped and started again.[ vm is booted with cinder volume]

This is not the case when using a disk from a nova store backend [ without cinder volume] (or at least the difference isn't so significantly high after a live migration).

Ceph 2.0 is backend 

Version-Release number of selected component (if applicable):

 


How reproducible:


Steps to Reproduce:
1. create a vm with cinder volume live migrate it 
2. check using ioping 
3.

Actual results:


Expected results:


Additional info:
he I/O latency of a cinder volume after live migration of an instance to which it's attached increases significantly. This stays increased till the VM is stopped and started again.

Here are some captured data for reference. 

ioping output on cinder vol attached to instance before live migration:

[cloud-user@test-virt-scsi ~]$ sudo ioping -c 10 .
4 KiB <<< . (xfs /dev/sda1): request=1 time=98.0 us (warmup)
4 KiB <<< . (xfs /dev/sda1): request=2 time=135.6 us
4 KiB <<< . (xfs /dev/sda1): request=3 time=155.5 us
4 KiB <<< . (xfs /dev/sda1): request=4 time=161.7 us
4 KiB <<< . (xfs /dev/sda1): request=5 time=148.4 us
4 KiB <<< . (xfs /dev/sda1): request=6 time=354.3 us
4 KiB <<< . (xfs /dev/sda1): request=7 time=138.0 us (fast)
4 KiB <<< . (xfs /dev/sda1): request=8 time=150.7 us
4 KiB <<< . (xfs /dev/sda1): request=9 time=149.6 us
4 KiB <<< . (xfs /dev/sda1): request=10 time=138.6 us (fast)

--- . (xfs /dev/sda1) ioping statistics ---
9 requests completed in 1.53 ms, 36 KiB read, 5.87 k iops, 22.9 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 135.6 us / 170.3 us / 354.3 us / 65.6 us


After live migration: 

[cloud-user@test-virt-scsi ~]$ sudo ioping -c 10 .
4 KiB <<< . (xfs /dev/sda1): request=1 time=1.03 ms (warmup)
4 KiB <<< . (xfs /dev/sda1): request=2 time=948.6 us
4 KiB <<< . (xfs /dev/sda1): request=3 time=955.7 us
4 KiB <<< . (xfs /dev/sda1): request=4 time=920.5 us
4 KiB <<< . (xfs /dev/sda1): request=5 time=1.03 ms
4 KiB <<< . (xfs /dev/sda1): request=6 time=838.2 us
4 KiB <<< . (xfs /dev/sda1): request=7 time=1.13 ms (slow)
4 KiB <<< . (xfs /dev/sda1): request=8 time=868.6 us
4 KiB <<< . (xfs /dev/sda1): request=9 time=985.2 us
4 KiB <<< . (xfs /dev/sda1): request=10 time=936.6 us

--- . (xfs /dev/sda1) ioping statistics ---
9 requests completed in 8.61 ms, 36 KiB read, 1.04 k iops, 4.08 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 838.2 us / 956.9 us / 1.13 ms / 81.0 us

This goes back to an average of 200us again after shutting down and starting up the instance. 

This is not the case when using a disk from a nova store backend (or at least the difference isn't so significantly high after a live migration). 

Before migration: 

[cloud-user@test-no-vol ~]$ sudo ioping -c 10 .
4 KiB <<< . (xfs /dev/sda1): request=1 time=103.6 us (warmup)
4 KiB <<< . (xfs /dev/sda1): request=2 time=236.6 us
4 KiB <<< . (xfs /dev/sda1): request=3 time=284.2 us
4 KiB <<< . (xfs /dev/sda1): request=4 time=163.3 us
4 KiB <<< . (xfs /dev/sda1): request=5 time=168.9 us
4 KiB <<< . (xfs /dev/sda1): request=6 time=221.9 us
4 KiB <<< . (xfs /dev/sda1): request=7 time=182.1 us
4 KiB <<< . (xfs /dev/sda1): request=8 time=154.0 us (fast)
4 KiB <<< . (xfs /dev/sda1): request=9 time=179.8 us
4 KiB <<< . (xfs /dev/sda1): request=10 time=174.4 us

--- . (xfs /dev/sda1) ioping statistics ---
9 requests completed in 1.77 ms, 36 KiB read, 5.10 k iops, 19.9 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 154.0 us / 196.1 us / 284.2 us / 40.3 us

After migration: 

[cloud-user@test-no-vol ~]$ sudo ioping -c 10 .
4 KiB <<< . (xfs /dev/sda1): request=1 time=198.1 us (warmup)
4 KiB <<< . (xfs /dev/sda1): request=2 time=229.3 us
4 KiB <<< . (xfs /dev/sda1): request=3 time=253.2 us
4 KiB <<< . (xfs /dev/sda1): request=4 time=230.5 us
4 KiB <<< . (xfs /dev/sda1): request=5 time=443.3 us
4 KiB <<< . (xfs /dev/sda1): request=6 time=390.6 us
4 KiB <<< . (xfs /dev/sda1): request=7 time=249.4 us
4 KiB <<< . (xfs /dev/sda1): request=8 time=231.3 us (fast)
4 KiB <<< . (xfs /dev/sda1): request=9 time=247.9 us
4 KiB <<< . (xfs /dev/sda1): request=10 time=238.2 us (fast)

--- . (xfs /dev/sda1) ioping statistics ---
9 requests completed in 2.51 ms, 36 KiB read, 3.58 k iops, 14.0 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 229.3 us / 279.3 us / 443.3 us / 75.1 us
----------------- 
The storage backend for both cinder and nova is Ceph 2.x. The setup doesn't have any load yet -- so not a bad neighbour effect on the compute node after live migration. Also not related to a specific compute nodes' network performance, since it comes back to 200us after vm shutdown and startup. 

After repeated tests this can be consistently captured: the cinder volume I/O latency seems to significantly increase after live migration, and doesn't decrease till the instance is shut off, and started again.
Comment 3 Gorka Eguileor 2017-06-22 04:27:31 EDT
Reassigning it to Nova since it's unrelated to Cinder.  With the RBD backend Cinder service has no affect on the live migration since both the attach/detach and initialize/terminate connection are basically NOOPs.

I think this could be related to a time delta between the source and destination hosts, because a tsc difference could result in a live migrated guest becoming sluggish or even hanging.
Comment 6 Kashyap Chamarthy 2017-06-27 10:46:28 EDT
Trying to wrap my head around what precisely is causing the I/O latency
after migration.  And here's the report (I will split it into two comments; the next comment will contain potential tests we'd like to be done to better understand the problem) of my last couple of hours of investigation of this issue with QEMU developers (thanks: Stefan Hajnoczi, Dave Gilbert, and Dan Berrange.)

  - I see that, from the 'sosreports', that all the disks are RBD
    volumes:

    <disk type="network" device="disk">
      <driver name="qemu" type="raw" cache="writeback" discard="unmap"/>
        <source protocol="rbd"
        name="volumes/volume-5a815515-339d-4539-8ab9-415b662232f7">

    Please confirm if it's otherwise.

  - Next, please confirm if the source and destination hosts are in a
    synchronized state W.R.T wall-clock time.  Because of the following:

        Quoting a post[1] written by an ex-Red Hatter who was a QEMU live
        migration developer:

        "The wall-clock times on both the servers must match.  NTP helps
        with this, but configuring the right time zones and ensuring the
        times are the same on both the hosts are essential.  We have had
        several bug reports where migration initially succeeds, but later
        the guest becomes sluggish or stops running entirely; and it was
        quite difficult to zero down the problem to the time delta on the
        hosts.  Windows guests are especially vulnerable to this source of
        misconfiguration."

       [1]
       https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/

  - I also checked on the Ceph upstream channel to rule out a certain
    Ceph issue:

        [Kashyap] Is there a performance hit to be expected when two clients
        access a file/volume at the same time?  In this case the said
        'client' is QEMU, during live migration.  Maybe this forces Ceph to
        route the request to ensure data consistency or something like that.
    
        [Peetaur2] Live migration of RAM and VM, but same storage, yes?
    
        [Kashyap] Yes. 
    
        [Peetaur2] In those cases I wouldn't expect any noticable effect,
        unless you count the 10ms or whatever when it finishes and pauses
        for a short time
Comment 7 Kashyap Chamarthy 2017-06-27 11:09:46 EDT
Tests to understand the behavior more closely
---------------------------------------------

[Noted in increasing complexity.]

(1) Examine the QEMU I/O request counters.  One can see how many
    requests have been processed.

    Run: 
    
        $ virsh domblkstat nova-instance-1
        
    On source (before migration) and destination (after migration) host
    -- then post that somewhere in a plain text file, to examine the I/O
    data.

    Stefan Hajnoczi notes: 
    
      This approach is only reliable if 1) the guest is idle besides
      running the benchmark 2) they know what workload the benchmark is
      generating.
    
      If `ioping` only generates 1 write per second, then it could get
      lost in noise (e.g. guest kernel doing read-ahead).  But if you
      know to expect a significant I/O pattern, then it could be a
      useful way of checking for disk accesses.


(2) It is useful to repeat the `ioping` benchmark on a RAM disk and see
    if it hits the same problem. Because, this test removes any actual
    I/O out of the picture.  If you still see the latency, this test
    will tell us, the issue has nothing to do with I/O, but somewhere
    else.  (Thanks to Dave Gilbert for this suggestion.)


(3) Use `fio` with 'direct=1' (which uses non-bufferred I/O) is a good
    sanity check if you want to get a second opinion regarding latency.
    Here's a config and how to run it (after installing the `fio` RPM):

    Prepare this config file:

        $ cat rand-write.fio
        [writes]
        rw=randwrite
        direct=1
        ioengine=libaio
        bs=512
        filename=/export/fio-rand-write
        time_based
        runtime=30
        size=4K

    And run it as:

        $ fio rand-write.fio


(4) Tracing QEMU disk I/O with SystemTap.  If the first test doesn't
    yield any useful detail, we may have to perform more instrumentation
    to trace QEMU I/O using SystemTap.  Stefan Hajnoczi has suggested an
    approach with SystemTap: It is possible to trace QEMU disk I/O using
    the following 'read', 'write', and 'completion' probes:

        qemu.system.x86_64.virtio_blk_handle_read
        qemu.system.x86_64.virtio_blk_handle_write
        qemu.system.x86_64.virtio_blk_rw_complete

    These probes are defined in
    /usr/share/systemtap/tapset/qemu-system-x86_64.stp
   
    Example symtax (this is not to be taken literally, but we need to
    use the above mentioned 'virtio_blk' trace events):

        probe qemu2.system.x86_64.blk_co_preadv
        {
                   printf("blk=%p,bs=%p,offset=%lu,bytes=%u,flags=%#x\n",
                                 blk, bs, offset, bytes, flags);
        }
    
    Its invocation is:

        $ stap myscript.stp

    It should find the running QEMU process and print to stdout.

    NB: The tracepoints are in QEMU so `stap` needs to be run on the
        host.
Comment 8 Anil Dhingra 2017-06-28 04:53:00 EDT
thhanks for update 

- Yes all the disks are RBD

- Time is already in sync & verified output in #comment 4

- i will request the consultant to do the requested test but it may take time as [vendor] onsite engagement consulting hrs are over
Comment 9 Anil Dhingra 2017-06-28 07:45:55 EDT
Please find below result available in case also 
=======================



Results inline:

> (1) Examine the QEMU I/O request counters.  One can see how many
>     requests have been processed.
> 
>     Run:
> 
>         $ virsh domblkstat nova-instance-1
> 
>     On source (before migration) and destination (after migration) host
>     -- then post that somewhere in a plain text file, to examine the I/O
>     data.
> 
>     Stefan Hajnoczi notes:
> 
>       This approach is only reliable if 1) the guest is idle besides
>      running the benchmark 2) they know what workload the benchmark is
>       generating.
> 
>       If `ioping` only generates 1 write per second, then it could get
>       lost in noise (e.g. guest kernel doing read-ahead).  But if you
>       know to expect a significant I/O pattern, then it could be a
>       useful way of checking for disk accesses.

combining the results of the test results with fio tests run later. 
 
> (2) It is useful to repeat the `ioping` benchmark on a RAM disk and see
>     if it hits the same problem. Because, this test removes any actual
>     I/O out of the picture.  If you still see the latency, this test
>     will tell us, the issue has nothing to do with I/O, but somewhere
>     else.  (Thanks to Dave Gilbert for this suggestion.)

I mounted the tmpfs to /srv to run the test. 

Don't see any significant difference. Before migration the average was 2-3 us. After migration it's also same. 

It's not an iops issue across -- since a Nova disk (also on ceph) doesn't show any significant latency; only constrained to a Cinder disk [doesn't matter whether it's a cinder-based root disk (BFV), or a Cinder volume attached anywhere in the filesystem]. 

Reference (also sharing comparisons with nova & cinder-managed disks (to draw attention to the difference between nova disks and cinder disks attached to the same server): 

tmpfs mount, and ioping test results run before migration.

[root@cinder-test ~]# mount -t tmpfs -o size=500m tmpfs /srv/ 
[root@cinder-test ~]# df -h 
Filesystem      Size  Used Avail Use% Mounted on 
/dev/vda1        40G  965M   40G   3% / 
devtmpfs        1.9G     0  1.9G   0% /dev 
tmpfs           1.9G     0  1.9G   0% /dev/shm 
tmpfs           1.9G   17M  1.9G   1% /run 
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup 
tmpfs           380M     0  380M   0% /run/user/1000 
tmpfs           500M     0  500M   0% /srv 
[root@cinder-test ~]# cd /srv/


PRE-LIVE-MIGRATION RESULTS:

ioping test on ramdisk:

[root@cinder-test srv]# ioping -c 10 . 
4 KiB <<< . (tmpfs tmpfs): request=1 time=2.96 us (warmup) 
4 KiB <<< . (tmpfs tmpfs): request=2 time=3.85 us 
4 KiB <<< . (tmpfs tmpfs): request=3 time=2.29 us 
4 KiB <<< . (tmpfs tmpfs): request=4 time=2.54 us 
4 KiB <<< . (tmpfs tmpfs): request=5 time=2.10 us 
4 KiB <<< . (tmpfs tmpfs): request=6 time=2.57 us 
4 KiB <<< . (tmpfs tmpfs): request=7 time=2.14 us (fast) 
4 KiB <<< . (tmpfs tmpfs): request=8 time=2.35 us 
4 KiB <<< . (tmpfs tmpfs): request=9 time=2.29 us 
4 KiB <<< . (tmpfs tmpfs): request=10 time=2.20 us 

--- . (tmpfs tmpfs) ioping statistics --- 
9 requests completed in 22.3 us, 36 KiB read, 402.9 k iops, 1.54 GiB/s 
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s 
min/avg/max/mdev = 2.10 us / 2.48 us / 3.85 us / 507 ns

ioping test on nova disk (nova backend is ceph): 

[root@cinder-test ~]# ioping -c 10 .    
4 KiB <<< . (xfs /dev/vda1): request=1 time=80.9 us (warmup) 
4 KiB <<< . (xfs /dev/vda1): request=2 time=131.4 us 
4 KiB <<< . (xfs /dev/vda1): request=3 time=149.4 us 
4 KiB <<< . (xfs /dev/vda1): request=4 time=127.1 us 
4 KiB <<< . (xfs /dev/vda1): request=5 time=127.0 us 
4 KiB <<< . (xfs /dev/vda1): request=6 time=125.3 us 
4 KiB <<< . (xfs /dev/vda1): request=7 time=124.3 us (fast) 
4 KiB <<< . (xfs /dev/vda1): request=8 time=148.5 us (slow) 
4 KiB <<< . (xfs /dev/vda1): request=9 time=98.9 us (fast) 
4 KiB <<< . (xfs /dev/vda1): request=10 time=101.8 us 

--- . (xfs /dev/vda1) ioping statistics --- 
9 requests completed in 1.13 ms, 36 KiB read, 7.94 k iops, 31.0 MiB/s 
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s 
min/avg/max/mdev = 98.9 us / 125.9 us / 149.4 us / 16.4 us

ioping test on cinder-managed disk (cinder backend is ceph)

[root@cinder-test mnt]# ioping -c 10 . 
4 KiB <<< . (xfs /dev/vdb): request=1 time=73.6 us (warmup) 
4 KiB <<< . (xfs /dev/vdb): request=2 time=141.9 us 
4 KiB <<< . (xfs /dev/vdb): request=3 time=147.6 us 
4 KiB <<< . (xfs /dev/vdb): request=4 time=233.6 us 
4 KiB <<< . (xfs /dev/vdb): request=5 time=123.3 us 
4 KiB <<< . (xfs /dev/vdb): request=6 time=190.4 us 
4 KiB <<< . (xfs /dev/vdb): request=7 time=159.7 us 
4 KiB <<< . (xfs /dev/vdb): request=8 time=142.3 us 
4 KiB <<< . (xfs /dev/vdb): request=9 time=155.8 us 
4 KiB <<< . (xfs /dev/vdb): request=10 time=139.6 us 

--- . (xfs /dev/vdb) ioping statistics --- 
9 requests completed in 1.43 ms, 36 KiB read, 6.28 k iops, 24.5 MiB/s 
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s 
min/avg/max/mdev = 123.3 us / 159.4 us / 233.6 us / 31.5 us

================ 

POST-LIVE-MIGRATION RESULTS:

Ramdisk: 

[root@cinder-test srv]# ioping -c 10 .  
4 KiB <<< . (tmpfs tmpfs): request=1 time=2.13 us (warmup) 
4 KiB <<< . (tmpfs tmpfs): request=2 time=2.21 us 
4 KiB <<< . (tmpfs tmpfs): request=3 time=1.97 us 
4 KiB <<< . (tmpfs tmpfs): request=4 time=1.88 us 
4 KiB <<< . (tmpfs tmpfs): request=5 time=4.29 us 
4 KiB <<< . (tmpfs tmpfs): request=6 time=1.94 us 
4 KiB <<< . (tmpfs tmpfs): request=7 time=1.96 us (fast) 
4 KiB <<< . (tmpfs tmpfs): request=8 time=1.99 us (fast) 
4 KiB <<< . (tmpfs tmpfs): request=9 time=1.97 us (fast) 
4 KiB <<< . (tmpfs tmpfs): request=10 time=4.57 us (slow) 

--- . (tmpfs tmpfs) ioping statistics --- 
9 requests completed in 22.8 us, 36 KiB read, 395.3 k iops, 1.51 GiB/s 
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s 
min/avg/max/mdev = 1.88 us / 2.53 us / 4.57 us / 1.02 us


Nova disk: 

[root@cinder-test ~]# ioping -c 10 . 
4 KiB <<< . (xfs /dev/vda1): request=1 time=148.4 us (warmup) 
4 KiB <<< . (xfs /dev/vda1): request=2 time=200.4 us 
4 KiB <<< . (xfs /dev/vda1): request=3 time=186.5 us 
4 KiB <<< . (xfs /dev/vda1): request=4 time=186.9 us 
4 KiB <<< . (xfs /dev/vda1): request=5 time=184.7 us 
4 KiB <<< . (xfs /dev/vda1): request=6 time=191.6 us 
4 KiB <<< . (xfs /dev/vda1): request=7 time=184.5 us (fast) 
4 KiB <<< . (xfs /dev/vda1): request=8 time=193.1 us 
4 KiB <<< . (xfs /dev/vda1): request=9 time=184.9 us (fast) 
4 KiB <<< . (xfs /dev/vda1): request=10 time=188.7 us 

--- . (xfs /dev/vda1) ioping statistics --- 
9 requests completed in 1.70 ms, 36 KiB read, 5.29 k iops, 20.7 MiB/s 
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s 
min/avg/max/mdev = 184.5 us / 189.0 us / 200.4 us / 4.93 us


Cinder disk: 

[root@cinder-test mnt]# ioping -c 10 .                   
4 KiB <<< . (xfs /dev/vdb): request=1 time=688.8 us (warmup) 
4 KiB <<< . (xfs /dev/vdb): request=2 time=1.01 ms 
4 KiB <<< . (xfs /dev/vdb): request=3 time=867.7 us 
4 KiB <<< . (xfs /dev/vdb): request=4 time=950.4 us 
4 KiB <<< . (xfs /dev/vdb): request=5 time=1.25 ms 
4 KiB <<< . (xfs /dev/vdb): request=6 time=951.6 us 
4 KiB <<< . (xfs /dev/vdb): request=7 time=775.1 us (fast) 
4 KiB <<< . (xfs /dev/vdb): request=8 time=776.3 us (fast) 
4 KiB <<< . (xfs /dev/vdb): request=9 time=872.2 us 
4 KiB <<< . (xfs /dev/vdb): request=10 time=734.2 us (fast) 

--- . (xfs /dev/vdb) ioping statistics --- 
9 requests completed in 8.19 ms, 36 KiB read, 1.10 k iops, 4.29 MiB/s 
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s 
min/avg/max/mdev = 734.2 us / 910.1 us / 1.25 ms / 148.8 us

> 
> (3) Use `fio` with 'direct=1' (which uses non-bufferred I/O) is a good
>     sanity check if you want to get a second opinion regarding latency.
>     Here's a config and how to run it (after installing the `fio` RPM):
> 
>     Prepare this config file:
>         $ cat rand-write.fio
>         [writes]
>         rw=randwrite
>         direct=1
>       ioengine=libaio
>         bs=512
>        filename=/export/fio-rand-write
>         time_based
>         runtime=30
>         size=4K
> 
>     And run it as:
>         $ fio rand-write.fio


Disk mounts: 

[root@cinder-test ~]# df -h 
Filesystem      Size  Used Avail Use% Mounted on 
/dev/vda1        40G  1.3G   39G   4% /  <-----------------------------Nova Disk                                 
devtmpfs        1.9G     0  1.9G   0% /dev 
tmpfs           1.9G     0  1.9G   0% /dev/shm 
tmpfs           1.9G   17M  1.9G   1% /run 
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup 
/dev/vdb         40G   33M   40G   1% /mnt <---------------------------Cinder Disk
tmpfs           380M     0  380M   0% /run/user/1000 
tmpfs           380M     0  380M   0% /run/user/0



Fio conf file in /root/ (nova disk): 

[root@cinder-test ~]# cat rand-write.fio  
[writes] 
rw=randwrite 
direct=1 
ioengine=libaio 
bs=512 
filename=/root/fio-rand-write 
time_based 
runtime=30 
size=4K


Fio conf file in /mnt/ (cinder disk)

[root@cinder-test ~]# cat /mnt/rand-write.fio  
[writes] 
rw=randwrite 
direct=1 
ioengine=libaio 
bs=512 
filename=/mnt/fio-rand-write 
time_based 
runtime=30 
size=4K

------------------------ 

Results -- combining virsh domblkstat nova-instance details (instance on compute node 1 before migration). 

Pre-migration fio results -- cinder: 

[root@cinder-test mnt]# fio rand-write.fio 
writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/13222KB/0KB /s] [0/26.5K/0 iops] [eta 00m:00s]
writes: (groupid=0, jobs=1): err= 0: pid=9357: Wed Jun 28 07:00:18 2017
  write: io=145385KB, bw=4848.3KB/s, iops=9696, runt= 29987msec
    slat (usec): min=3, max=605, avg= 6.53, stdev= 3.77
    clat (usec): min=20, max=41570, avg=94.66, stdev=341.63
     lat (usec): min=28, max=41586, avg=101.40, stdev=344.16
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   27], 10.00th=[   28], 20.00th=[   30],
     | 30.00th=[   30], 40.00th=[   30], 50.00th=[   30], 60.00th=[   31],
     | 70.00th=[   32], 80.00th=[   33], 90.00th=[   34], 95.00th=[   39],
     | 99.00th=[ 1848], 99.50th=[ 1976], 99.90th=[ 2192], 99.95th=[ 2256],
     | 99.99th=[ 2480]
    lat (usec) : 50=96.00%, 100=0.18%, 250=0.02%, 500=0.02%, 750=0.01%
    lat (usec) : 1000=0.01%
    lat (msec) : 2=3.36%, 4=0.41%, 10=0.01%, 50=0.01%
  cpu          : usr=1.88%, sys=8.58%, ctx=290778, majf=0, minf=32
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=290770/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=145385KB, aggrb=4848KB/s, minb=4848KB/s, maxb=4848KB/s, mint=29987msec, maxt=29987msec

Disk stats (read/write):
  vdb: ios=0/288894, merge=0/0, ticks=0/26948, in_queue=26916, util=89.70% 

------------- 

Pre-migration -- while fio benchmark running on Cinder disk -- command randomly executed while fio was running: 

------------------ 

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2179025
 wr_bytes 1142330368
 flush_operations 58
 rd_total_times 5608454723
 wr_total_times 57818127850
 flush_total_times 273423193

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2181619
 wr_bytes 1143658496
 flush_operations 58
 rd_total_times 5608454723
 wr_total_times 62239063054
 flush_total_times 273423193

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2183176
 wr_bytes 1144455680
 flush_operations 58
 rd_total_times 5608454723
 wr_total_times 64742764011
 flush_total_times 273423193

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2184700
 wr_bytes 1145235968
 flush_operations 58
 rd_total_times 5608454723
 wr_total_times 67016778336
 flush_total_times 273423193

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2185951
 wr_bytes 1145876480
 flush_operations 58
 rd_total_times 5608454723
 wr_total_times 68892866636
 flush_total_times 273423193

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2187604
 wr_bytes 1146722816
 flush_operations 58
 rd_total_times 5608454723
 wr_total_times 71698739790
 flush_total_times 273423193

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2217610
 wr_bytes 1162120192
 flush_operations 62
 rd_total_times 5608454723
 wr_total_times 73398293893
 flush_total_times 280723651

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2293585
 wr_bytes 1201019392
 flush_operations 62
 rd_total_times 5608454723
 wr_total_times 75041029644
 flush_total_times 280723651

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9503
 rd_bytes 270524928
 wr_req 2390092
 wr_bytes 1250430976
 flush_operations 62
 rd_total_times 5608454723
 wr_total_times 77226013060
 flush_total_times 280723651

=============== 

Pre-migration fio results -- nova: 
-------------- 
[root@cinder-test ~]# fio rand-write.fio 
writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/11914KB/0KB /s] [0/23.9K/0 iops] [eta 00m:00s]
writes: (groupid=0, jobs=1): err= 0: pid=9312: Wed Jun 28 06:58:11 2017
  write: io=367299KB, bw=12256KB/s, iops=24512, runt= 29968msec
    slat (usec): min=3, max=271, avg= 5.38, stdev= 1.88
    clat (usec): min=13, max=2294, avg=33.81, stdev=10.00
     lat (usec): min=28, max=2300, avg=39.37, stdev=10.22
    clat percentiles (usec):
     |  1.00th=[   29],  5.00th=[   30], 10.00th=[   30], 20.00th=[   30],
     | 30.00th=[   31], 40.00th=[   32], 50.00th=[   33], 60.00th=[   33],
     | 70.00th=[   34], 80.00th=[   36], 90.00th=[   39], 95.00th=[   43],
     | 99.00th=[   49], 99.50th=[   54], 99.90th=[   69], 99.95th=[   94],
     | 99.99th=[  278]
    lat (usec) : 20=0.01%, 50=99.08%, 100=0.87%, 250=0.02%, 500=0.02%
    lat (usec) : 750=0.01%, 1000=0.01%
    lat (msec) : 2=0.01%, 4=0.01%
  cpu          : usr=3.25%, sys=18.25%, ctx=734602, majf=0, minf=31
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=734597/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=367298KB, aggrb=12256KB/s, minb=12256KB/s, maxb=12256KB/s, mint=29968msec, maxt=29968msec

Disk stats (read/write):
  vda: ios=0/732575, merge=0/0, ticks=0/23982, in_queue=23901, util=79.66%

---------

Data collected while running the fio tests: 

Pre migration -- while fio benchmark running on nova root disk -- command randomly executed while benchmark was running: 

-------------- 
[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 1618880
 wr_bytes 855159808
 flush_operations 40
 rd_total_times 5608198928
 wr_total_times 41750684997
 flush_total_times 224341403

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 1701879
 wr_bytes 897655296
 flush_operations 40
 rd_total_times 5608198928
 wr_total_times 43544355787
 flush_total_times 224341403

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 1798892
 wr_bytes 947325952
 flush_operations 40
 rd_total_times 5608198928
 wr_total_times 45711246318
 flush_total_times 224341403

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 1888475
 wr_bytes 993192448
 flush_operations 40
 rd_total_times 5608198928
 wr_total_times 47692698583
 flush_total_times 224341403

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 1975088
 wr_bytes 1037538304
 flush_operations 40
 rd_total_times 5608198928
 wr_total_times 49570697933
 flush_total_times 224341403

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 2030490
 wr_bytes 1065904128
 flush_operations 40
 rd_total_times 5608198928
 wr_total_times 51176172805
 flush_total_times 224341403

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 2092205
 wr_bytes 1097502720
 flush_operations 42
 rd_total_times 5608198928
 wr_total_times 52648933062
 flush_total_times 229713279

[root@myr-gs-sg-ocpn-01 ~]# virsh domblkstat instance-0000008c
 rd_req 9502
 rd_bytes 270520832
 wr_req 2137502
 wr_bytes 1120694784
 flush_operations 42
 rd_total_times 5608198928
 wr_total_times 53629498113
 flush_total_times 229713279

--------------------- 

POST MIGRATION: 

Instance has moved to compute node 2 (from compute node 1)
_________________

Post-live-migration Cinder fio bench results (note the drastic drop in IOPS between pre- and post-migration by almost 20 times: 

[root@cinder-test mnt]# fio rand-write.fio 
writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/310KB/0KB /s] [0/620/0 iops] [eta 00m:00s]
writes: (groupid=0, jobs=1): err= 0: pid=9378: Wed Jun 28 07:09:08 2017
  write: io=8807.6KB, bw=300629B/s, iops=587, runt= 30000msec
    slat (usec): min=8, max=252, avg=17.54, stdev= 5.43
    clat (usec): min=1062, max=41568, avg=1680.09, stdev=870.98
     lat (usec): min=1079, max=41585, avg=1698.03, stdev=871.16
    clat percentiles (usec):
     |  1.00th=[ 1256],  5.00th=[ 1368], 10.00th=[ 1432], 20.00th=[ 1496],
     | 30.00th=[ 1544], 40.00th=[ 1592], 50.00th=[ 1640], 60.00th=[ 1704],
     | 70.00th=[ 1752], 80.00th=[ 1816], 90.00th=[ 1912], 95.00th=[ 1992],
     | 99.00th=[ 2160], 99.50th=[ 2256], 99.90th=[ 3920], 99.95th=[10688],
     | 99.99th=[41728]
    lat (msec) : 2=95.12%, 4=4.79%, 10=0.04%, 20=0.01%, 50=0.05%
  cpu          : usr=0.46%, sys=1.50%, ctx=17618, majf=0, minf=32
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=17615/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=8807KB, aggrb=293KB/s, minb=293KB/s, maxb=293KB/s, mint=30000msec, maxt=30000msec

Disk stats (read/write):
  vdb: ios=0/17568, merge=0/0, ticks=0/29436, in_queue=29429, util=98.09%
------------------ 
Post-migration Cinder: -- command randomly executed during fio benchmark.
---------------------- 
[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 13008
 wr_bytes 6668800
 flush_operations 4
 rd_total_times 0
 wr_total_times 21240709235
 flush_total_times 8675859

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 14308
 wr_bytes 7334400
 flush_operations 4
 rd_total_times 0
 wr_total_times 23404807773
 flush_total_times 8675859

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 17621
 wr_bytes 9032192
 flush_operations 8
 rd_total_times 0
 wr_total_times 28647256606
 flush_total_times 11206144
====================

Post-migration fio bench results on the nova disk (IOPS results captured is consistent between pre- and post-migration): 
------------------------- 

[root@cinder-test ~]# fio rand-write.fio 
writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/12347KB/0KB /s] [0/24.7K/0 iops] [eta 00m:00s]
writes: (groupid=0, jobs=1): err= 0: pid=9383: Wed Jun 28 07:11:33 2017
  write: io=417958KB, bw=13948KB/s, iops=27896, runt= 29965msec
    slat (usec): min=2, max=270, avg= 5.08, stdev= 2.03
    clat (usec): min=10, max=8217, avg=29.08, stdev=19.96
     lat (usec): min=25, max=8222, avg=34.34, stdev=20.08
    clat percentiles (usec):
     |  1.00th=[   27],  5.00th=[   27], 10.00th=[   27], 20.00th=[   28],
     | 30.00th=[   28], 40.00th=[   28], 50.00th=[   28], 60.00th=[   28],
     | 70.00th=[   29], 80.00th=[   30], 90.00th=[   31], 95.00th=[   33],
     | 99.00th=[   43], 99.50th=[   45], 99.90th=[   55], 99.95th=[   73],
     | 99.99th=[  282]
    lat (usec) : 20=0.01%, 50=99.81%, 100=0.16%, 250=0.01%, 500=0.02%
    lat (usec) : 750=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=4.41%, sys=19.89%, ctx=835921, majf=0, minf=31
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=835916/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=417958KB, aggrb=13948KB/s, minb=13948KB/s, maxb=13948KB/s, mint=29965msec, maxt=29965msec

Disk stats (read/write):
  vda: ios=0/833772, merge=0/0, ticks=0/24135, in_queue=24065, util=80.20%
----------


Post-migration Nova -- command randomly executed during fio benchmark: 
------------------- 

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 368858
 wr_bytes 188915712
 flush_operations 18
 rd_total_times 0
 wr_total_times 36183512565
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 450204
 wr_bytes 230564864
 flush_operations 18
 rd_total_times 0
 wr_total_times 37912236976
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 490439
 wr_bytes 251165184
 flush_operations 18
 rd_total_times 0
 wr_total_times 38772367741
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 545902
 wr_bytes 279562240
 flush_operations 18
 rd_total_times 0
 wr_total_times 39971329161
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 606208
 wr_bytes 310438912
 flush_operations 18
 rd_total_times 0
 wr_total_times 41271424813
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 664185
 wr_bytes 340123136
 flush_operations 18
 rd_total_times 0
 wr_total_times 42512679727
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 753278
 wr_bytes 385738752
 flush_operations 18
 rd_total_times 0
 wr_total_times 44419059221
 flush_total_times 23622536

[root@myr-gs-sg-ocpn-02 ~]# virsh domblkstat instance-0000008c
 rd_req 0
 rd_bytes 0
 wr_req 810608
 wr_bytes 415091712
 flush_operations 18
 rd_total_times 0
 wr_total_times 45650939594
 flush_total_times 23622536


Best,
Atanu
Comment 10 Daniel Berrange 2017-06-29 05:38:33 EDT
Comment #9 is referring to a VM instance-00000008c, which appears to have 2 disks

The sosreport we've got only has instance-00000056, instance-00000059, and instance-0000005c, all of which only have 1 disk, which appears to be cinder backed.

To understand why the 2 disks in instance-0000008c have differing performance behaviour we need to understand the QEMU configuration of this instance with 2 disks, so see what's different about the configuration of cinder vs nova disks in this setup.   So can we get the /var/run/libvirt/qemu/instance-0000008c.xml file from the target host
Comment 11 Anil Dhingra 2017-06-29 10:49:10 EDT
Please find below also fresh sosreport on collabshell from both compute node 1  where the VM was migrated from (node that I ran a few live migrations of instance (4c7a067d-6c06-49c7-a675-802730b00b90) between one and two to capture the results.... the final migration was from node 1 to node 2
===================

Copying the xml file below. Also will attach the sosreports from Compute Node 1 & 2. The instance ID is 4c7a067d-6c06-49c7-a675-802730b00b90; and it was migrated from compute node 1 to 2 (in fact, a few migrations between one and two -- but the final one was from 1 to 2) while capturing the data yesterday. 

Best,
Atanu 

[root@myr-gs-sg-ocpn-02 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 20    instance-0000008c              running

[root@myr-gs-sg-ocpn-02 ~]# cat /var/run/libvirt/qemu/instance-0000008c.xml
<!--
WARNING: THIS IS AN AUTO-GENERATED FILE. CHANGES TO IT ARE LIKELY TO BE
OVERWRITTEN AND LOST. Changes to this xml configuration should be made using:
  virsh edit instance-0000008c
or other application using the libvirt API.
-->

<domstatus state='running' reason='booted' pid='424294'>
  <monitor path='/var/lib/libvirt/qemu/domain-20-instance-0000008c/monitor.sock' json='1' type='unix'/>
  <vcpus>
    <vcpu id='0' pid='424362'/>
    <vcpu id='1' pid='424363'/>
  </vcpus>
  <qemuCaps>
    <flag name='kvm'/>
    <flag name='mem-path'/>
    <flag name='drive-serial'/>
    <flag name='chardev'/>
    <flag name='monitor-json'/>
    <flag name='sdl'/>
    <flag name='netdev'/>
    <flag name='rtc'/>
    <flag name='vhost-net'/>
    <flag name='no-hpet'/>
    <flag name='no-kvm-pit'/>
    <flag name='pci-configfd'/>
    <flag name='nodefconfig'/>
    <flag name='boot-menu'/>
    <flag name='fsdev'/>
    <flag name='name-process'/>
    <flag name='smbios-type'/>
    <flag name='vga-qxl'/>
    <flag name='spice'/>
    <flag name='vga-none'/>
    <flag name='boot-index'/>
    <flag name='hda-duplex'/>
    <flag name='drive-aio'/>
    <flag name='pci-multibus'/>
    <flag name='pci-bootindex'/>
    <flag name='ccid-emulated'/>
    <flag name='ccid-passthru'/>
    <flag name='chardev-spicevmc'/>
    <flag name='virtio-tx-alg'/>
    <flag name='device-qxl-vga'/>
    <flag name='pci-multifunction'/>
    <flag name='virtio-blk-pci.ioeventfd'/>
    <flag name='sga'/>
    <flag name='virtio-blk-pci.event_idx'/>
    <flag name='virtio-net-pci.event_idx'/>
    <flag name='cache-directsync'/>
    <flag name='piix3-usb-uhci'/>
    <flag name='piix4-usb-uhci'/>
    <flag name='usb-ehci'/>
    <flag name='ich9-usb-ehci1'/>
    <flag name='vt82c686b-usb-uhci'/>
    <flag name='usb-redir'/>
    <flag name='usb-hub'/>
    <flag name='no-shutdown'/>
    <flag name='cache-unsafe'/>
    <flag name='ich9-ahci'/>
    <flag name='no-acpi'/>
    <flag name='fsdev-readonly'/>
    <flag name='virtio-blk-pci.scsi'/>
    <flag name='drive-copy-on-read'/>
    <flag name='fsdev-writeout'/>
    <flag name='drive-iotune'/>
    <flag name='system_wakeup'/>
    <flag name='scsi-disk.channel'/>
    <flag name='scsi-block'/>
    <flag name='transaction'/>
    <flag name='block-job-async'/>
    <flag name='scsi-cd'/>
    <flag name='ide-cd'/>
    <flag name='no-user-config'/>
    <flag name='hda-micro'/>
    <flag name='dump-guest-memory'/>
    <flag name='nec-usb-xhci'/>
    <flag name='balloon-event'/>
    <flag name='bridge'/>
    <flag name='virtio-scsi-pci'/>
    <flag name='blockio'/>
    <flag name='disable-s3'/>
    <flag name='disable-s4'/>
    <flag name='usb-redir.filter'/>
    <flag name='ide-drive.wwn'/>
    <flag name='scsi-disk.wwn'/>
    <flag name='seccomp-sandbox'/>
    <flag name='reboot-timeout'/>
    <flag name='dump-guest-core'/>
    <flag name='seamless-migration'/>
    <flag name='block-commit'/>
    <flag name='vnc'/>
    <flag name='drive-mirror'/>
    <flag name='usb-redir.bootindex'/>
    <flag name='usb-host.bootindex'/>
    <flag name='blockdev-snapshot-sync'/>
    <flag name='qxl'/>
    <flag name='VGA'/>
    <flag name='cirrus-vga'/>
    <flag name='device-video-primary'/>
    <flag name='add-fd'/>
    <flag name='nbd-server'/>
    <flag name='virtio-rng'/>
    <flag name='rng-random'/>
    <flag name='rng-egd'/>
    <flag name='dtb'/>
    <flag name='ipv6-migration'/>
    <flag name='machine-opt'/>
    <flag name='machine-usb-opt'/>
    <flag name='pci-bridge'/>
    <flag name='vfio-pci'/>
    <flag name='vfio-pci.bootindex'/>
    <flag name='scsi-generic'/>
    <flag name='scsi-generic.bootindex'/>
    <flag name='mem-merge'/>
    <flag name='vnc-websocket'/>
    <flag name='drive-discard'/>
    <flag name='mlock'/>
    <flag name='vnc-share-policy'/>
    <flag name='device-del-event'/>
    <flag name='dmi-to-pci-bridge'/>
    <flag name='i440fx-pci-hole64-size'/>
    <flag name='q35-pci-hole64-size'/>
    <flag name='usb-storage'/>
    <flag name='usb-storage.removable'/>
    <flag name='virtio-mmio'/>
    <flag name='ich9-intel-hda'/>
    <flag name='kvm-pit-lost-tick-policy'/>
    <flag name='boot-strict'/>
    <flag name='pvpanic'/>
    <flag name='spice-file-xfer-disable'/>
    <flag name='spiceport'/>
    <flag name='usb-kbd'/>
    <flag name='host-pci-multidomain'/>
    <flag name='msg-timestamp'/>
    <flag name='active-commit'/>
    <flag name='change-backing-file'/>
    <flag name='memory-backend-ram'/>
    <flag name='numa'/>
    <flag name='memory-backend-file'/>
    <flag name='rtc-reset-reinjection'/>
    <flag name='splash-timeout'/>
    <flag name='iothread'/>
    <flag name='migrate-rdma'/>
    <flag name='ivshmem'/>
    <flag name='drive-iotune-max'/>
    <flag name='VGA.vgamem_mb'/>
    <flag name='qxl.vgamem_mb'/>
    <flag name='qxl-vga.vgamem_mb'/>
    <flag name='pc-dimm'/>
    <flag name='machine-vmport-opt'/>
    <flag name='aes-key-wrap'/>
    <flag name='dea-key-wrap'/>
    <flag name='pci-serial'/>
    <flag name='vhost-user-multiqueue'/>
    <flag name='migration-event'/>
    <flag name='ioh3420'/>
    <flag name='x3130-upstream'/>
    <flag name='xio3130-downstream'/>
    <flag name='rtl8139'/>
    <flag name='e1000'/>
    <flag name='virtio-net'/>
    <flag name='gic-version'/>
    <flag name='incoming-defer'/>
    <flag name='virtio-gpu'/>
    <flag name='virtio-keyboard'/>
    <flag name='virtio-mouse'/>
    <flag name='virtio-tablet'/>
    <flag name='virtio-input-host'/>
    <flag name='chardev-file-append'/>
    <flag name='ich9-disable-s3'/>
    <flag name='ich9-disable-s4'/>
    <flag name='vserport-change-event'/>
    <flag name='virtio-balloon-pci.deflate-on-oom'/>
    <flag name='qxl.vram64_size_mb'/>
    <flag name='qxl-vga.vram64_size_mb'/>
    <flag name='chardev-logfile'/>
    <flag name='debug-threads'/>
    <flag name='secret'/>
    <flag name='pxb'/>
    <flag name='pxb-pcie'/>
    <flag name='device-tray-moved-event'/>
    <flag name='nec-usb-xhci-ports'/>
    <flag name='virtio-scsi-pci.iothread'/>
    <flag name='name-guest'/>
    <flag name='drive-detect-zeroes'/>
    <flag name='tls-creds-x509'/>
    <flag name='intel-iommu'/>
    <flag name='smm'/>
    <flag name='query-hotpluggable-cpus'/>
    <flag name='virtio-net.rx_queue_size'/>
    <flag name='ivshmem-plain'/>
    <flag name='ivshmem-doorbell'/>
  </qemuCaps>
  <devices>
    <device alias='virtio-disk1'/>
    <device alias='virtio-disk0'/>
    <device alias='video0'/>
    <device alias='serial1'/>
    <device alias='serial0'/>
    <device alias='balloon0'/>
    <device alias='net0'/>
    <device alias='input0'/>
    <device alias='usb'/>
  </devices>
  <libDir path='/var/lib/libvirt/qemu/domain-20-instance-0000008c'/>
  <channelTargetDir path='/var/lib/libvirt/qemu/channel/target/domain-20-instance-0000008c'/>
  <domain type='kvm' id='20'>
    <name>instance-0000008c</name>
    <uuid>4c7a067d-6c06-49c7-a675-802730b00b90</uuid>
    <metadata>
      <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
        <nova:package version="14.0.3-9.el7ost"/>
        <nova:name>cinder-test</nova:name>
        <nova:creationTime>2017-06-28 14:05:28</nova:creationTime>
        <nova:flavor name="m1.medium">
          <nova:memory>4096</nova:memory>
          <nova:disk>40</nova:disk>
          <nova:swap>0</nova:swap>
          <nova:ephemeral>0</nova:ephemeral>
          <nova:vcpus>2</nova:vcpus>
        </nova:flavor>
        <nova:owner>
          <nova:user uuid="aa14d599c8694e1b859c3ac0bdf3d499">admin</nova:user>
          <nova:project uuid="4726e3735d18445d9f0816fc7849e33e">admin</nova:project>
        </nova:owner>
        <nova:root type="image" uuid="de627974-d900-4b71-bdd6-5b1138012422"/>
      </nova:instance>
    </metadata>
    <memory unit='KiB'>4194304</memory>
    <currentMemory unit='KiB'>4194304</currentMemory>
    <vcpu placement='static'>2</vcpu>
    <cputune>
      <shares>2048</shares>
    </cputune>
    <resource>
      <partition>/machine</partition>
    </resource>
    <sysinfo type='smbios'>
      <system>
        <entry name='manufacturer'>Red Hat</entry>
        <entry name='product'>OpenStack Compute</entry>
        <entry name='version'>14.0.3-9.el7ost</entry>
        <entry name='serial'>f6154140-1f78-4bb9-8c52-f3c7ba5bedd3</entry>
        <entry name='uuid'>4c7a067d-6c06-49c7-a675-802730b00b90</entry>
        <entry name='family'>Virtual Machine</entry>
      </system>
    </sysinfo>
    <os>
      <type arch='x86_64' machine='pc-i440fx-rhel7.3.0'>hvm</type>
      <boot dev='hd'/>
      <smbios mode='sysinfo'/>
    </os>
    <features>
      <acpi/>
      <apic/>
    </features>
    <cpu mode='host-model'>
      <model fallback='allow'/>
      <topology sockets='2' cores='1' threads='1'/>
    </cpu>
    <clock offset='utc'>
      <timer name='pit' tickpolicy='delay'/>
      <timer name='rtc' tickpolicy='catchup'/>
      <timer name='hpet' present='no'/>
    </clock>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>destroy</on_crash>
    <devices>
      <emulator>/usr/libexec/qemu-kvm</emulator>
      <disk type='network' device='disk'>
        <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
        <auth username='openstack'>
          <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/>
        </auth>
        <source protocol='rbd' name='vms/4c7a067d-6c06-49c7-a675-802730b00b90_disk'>
          <host name='192.168.23.12' port='6789'/>
          <host name='192.168.23.13' port='6789'/>
          <host name='192.168.23.14' port='6789'/>
        </source>
        <backingStore/>
        <target dev='vda' bus='virtio'/>
        <alias name='virtio-disk0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
      </disk>
      <disk type='network' device='disk'>
        <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
        <auth username='openstack'>
          <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/>
        </auth>
        <source protocol='rbd' name='volumes/volume-d485683d-5581-4ce9-8e7c-6bde9d285d86'>
          <host name='192.168.23.12' port='6789'/>
          <host name='192.168.23.13' port='6789'/>
          <host name='192.168.23.14' port='6789'/>
        </source>
        <backingStore/>
        <target dev='vdb' bus='virtio'/>
        <serial>d485683d-5581-4ce9-8e7c-6bde9d285d86</serial>
        <alias name='virtio-disk1'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
      </disk>
      <controller type='usb' index='0'>
        <alias name='usb'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
      </controller>
      <controller type='pci' index='0' model='pci-root'>
        <alias name='pci.0'/>
      </controller>
      <interface type='bridge'>
        <mac address='fa:16:3e:bb:d6:6b'/>
        <source bridge='qbr34fccbd9-7e'/>
        <target dev='tap34fccbd9-7e'/>
        <model type='virtio'/>
        <alias name='net0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
      </interface>
      <serial type='file'>
        <source path='/var/lib/nova/instances/4c7a067d-6c06-49c7-a675-802730b00b90/console.log'/>
        <target port='0'/>
        <alias name='serial0'/>
      </serial>
      <serial type='pty'>
        <source path='/dev/pts/0'/>
        <target port='1'/>
        <alias name='serial1'/>
      </serial>
      <console type='file'>
        <source path='/var/lib/nova/instances/4c7a067d-6c06-49c7-a675-802730b00b90/console.log'/>
        <target type='serial' port='0'/>
        <alias name='serial0'/>
      </console>
      <input type='tablet' bus='usb'>
        <alias name='input0'/>
        <address type='usb' bus='0' port='1'/>
      </input>
      <input type='mouse' bus='ps2'>
        <alias name='input1'/>
      </input>
      <input type='keyboard' bus='ps2'>
        <alias name='input2'/>
      </input>
      <graphics type='vnc' port='5900' autoport='yes' listen='0.0.0.0' keymap='en-us'>
        <listen type='address' address='0.0.0.0' fromConfig='0'/>
      </graphics>
      <video>
        <model type='cirrus' vram='16384' heads='1' primary='yes'/>
        <alias name='video0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
      </video>
      <memballoon model='virtio'>
        <stats period='10'/>
        <alias name='balloon0'/>
        <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
      </memballoon>
    </devices>
    <seclabel type='dynamic' model='selinux' relabel='yes'>
      <label>system_u:system_r:svirt_t:s0:c373,c579</label>
      <imagelabel>system_u:object_r:svirt_image_t:s0:c373,c579</imagelabel>
    </seclabel>
    <seclabel type='dynamic' model='dac' relabel='yes'>
      <label>+107:+107</label>
      <imagelabel>+107:+107</imagelabel>
    </seclabel>
  </domain>
</domstatus>
[root@myr-gs-sg-ocpn-02 ~]#
Comment 12 Daniel Berrange 2017-06-29 11:08:52 EDT
(In reply to Anil Dhingra from comment #11)
> Please find below also fresh sosreport on collabshell from both compute node
> 1  where the VM was migrated from (node that I ran a few live migrations of
> instance (4c7a067d-6c06-49c7-a675-802730b00b90) between one and two to
> capture the results.... the final migration was from node 1 to node 2
> ===================


>       <disk type='network' device='disk'>
>         <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
>         <auth username='openstack'>
>           <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/>
>         </auth>
>         <source protocol='rbd'
> name='vms/4c7a067d-6c06-49c7-a675-802730b00b90_disk'>
>           <host name='192.168.23.12' port='6789'/>
>           <host name='192.168.23.13' port='6789'/>
>           <host name='192.168.23.14' port='6789'/>
>         </source>
>         <backingStore/>
>         <target dev='vda' bus='virtio'/>
>         <alias name='virtio-disk0'/>
>         <address type='pci' domain='0x0000' bus='0x00' slot='0x04'
> function='0x0'/>
>       </disk>
>       <disk type='network' device='disk'>
>         <driver name='qemu' type='raw' cache='writeback' discard='unmap'/>
>         <auth username='openstack'>
>           <secret type='ceph' uuid='aff97496-b97b-438a-935a-6ea72a334735'/>
>         </auth>
>         <source protocol='rbd'
> name='volumes/volume-d485683d-5581-4ce9-8e7c-6bde9d285d86'>
>           <host name='192.168.23.12' port='6789'/>
>           <host name='192.168.23.13' port='6789'/>
>           <host name='192.168.23.14' port='6789'/>
>         </source>
>         <backingStore/>
>         <target dev='vdb' bus='virtio'/>
>         <serial>d485683d-5581-4ce9-8e7c-6bde9d285d86</serial>
>         <alias name='virtio-disk1'/>
>         <address type='pci' domain='0x0000' bus='0x00' slot='0x05'
> function='0x0'/>
>       </disk>

Ok, both the Nova and Cinder disks here are pointing to the exact same RBD server, and have identical settings. The only potentially significant difference is that one disk is in the "vms" pool and one in the "volumes" pool. So I/O is potentially serviced by different underlying storage in the RBD server.

It is hard to see why use of a different storage pool would have an effect on performance only after migration.

It is also hard to see why a QEMU bug would only affect 1 disk after migration though, given the identical configs.
Comment 13 Anil Dhingra 2017-06-29 22:57:59 EDT
another test
=========

instead of rebooting, if I umount the cinder volume disk, detach the disk from the instance, and reattach it (all this on live migrated server where the performance has already degraded in cinder only, w/o rebooting the server), it fixes the latency after reattachment. That's why I feel it's related to Cinder only, and something after live migration triggers this oddity. 

References below: 

1. ioping results after live migration (fio already shared in my last comment):

[root@cinder-test ~]# cd /mnt/
[root@cinder-test mnt]# ioping -c 10 .
4 KiB <<< . (xfs /dev/vdb): request=1 time=833.4 us (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=1.30 ms
4 KiB <<< . (xfs /dev/vdb): request=3 time=1.17 ms
4 KiB <<< . (xfs /dev/vdb): request=4 time=997.2 us
4 KiB <<< . (xfs /dev/vdb): request=5 time=965.0 us
4 KiB <<< . (xfs /dev/vdb): request=6 time=966.3 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=997.2 us
^C
--- . (xfs /dev/vdb) ioping statistics ---
6 requests completed in 6.40 ms, 24 KiB read, 937 iops, 3.66 MiB/s
generated 7 requests in 6.40 s, 28 KiB, 1 iops, 4.38 KiB/s
min/avg/max/mdev = 965.0 us / 1.07 ms / 1.30 ms / 125.9 us

2. umount the disk

[root@cinder-test mnt]# cd
[root@cinder-test ~]# umount /mnt/
[root@cinder-test ~]# df -h                                                                                                                                                                                                            
Filesystem      Size  Used Avail Use% Mounted on                                                                                                                                                                                             
/dev/vda1        40G  1.3G   39G   4% /                                                                                                                                                                                                      
devtmpfs        1.9G     0  1.9G   0% /dev                                                                                                                                                                                                   
tmpfs           1.9G     0  1.9G   0% /dev/shm                                                                                                                                                                                               
tmpfs           1.9G   17M  1.9G   1% /run                                                                                                                                                                                                   
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup                                                                                                                                                                                         
tmpfs           380M     0  380M   0% /run/user/1000                                                                                                                                                                                         

3. Detach the vol from Horizon and confirm on VM:

[root@cinder-test ~]# cat /proc/partitions                                                                                                                                                                                             
major minor  #blocks  name                                                                                                                                                                                                                   
                                                                                                                                                                                                                                             
 253        0   41943040 vda                                                                                                                                                                                                                 
 253        1   41935052 vda1                                                                                                                                                                                                                

4. Reattach the vol from Horizon and re-confirm availability:

[root@cinder-test ~]#  cat /proc/partitions                                                                                                                                                                                             
major minor  #blocks  name                                                                                                                                                                                                                   
                                                                                                                                                                                                                                             
 253        0   41943040 vda                                                                                                                                                                                                                 
 253        1   41935052 vda1                                                                                                                                                                                                                
 253       16   41943040 vdb                                                                                                                                                                                                                 

5. Mount the disk at /mnt/ again:  

[root@cinder-test ~]# mount -a
[root@cinder-test ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        40G  1.3G   39G   4% /
devtmpfs        1.9G     0  1.9G   0% /dev
tmpfs           1.9G     0  1.9G   0% /dev/shm
tmpfs           1.9G   17M  1.9G   1% /run
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
tmpfs           380M     0  380M   0% /run/user/1000
/dev/vdb         40G   33M   40G   1% /mnt <---------------------- cinder disk

6. Run ioping test: 
                                                                                                                                                                                                  
[root@cinder-test ~]# cd /mnt/                                                                                                                                                                                                               
[root@cinder-test mnt]# ioping -c 10 .                                                                                                                                                                                                      
4 KiB <<< . (xfs /dev/vdb): request=1 time=124.3 us (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=311.7 us
4 KiB <<< . (xfs /dev/vdb): request=3 time=211.6 us
4 KiB <<< . (xfs /dev/vdb): request=4 time=251.5 us
4 KiB <<< . (xfs /dev/vdb): request=5 time=180.3 us
4 KiB <<< . (xfs /dev/vdb): request=6 time=179.7 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=206.1 us
4 KiB <<< . (xfs /dev/vdb): request=8 time=266.7 us
4 KiB <<< . (xfs /dev/vdb): request=9 time=301.3 us
4 KiB <<< . (xfs /dev/vdb): request=10 time=182.5 us (fast)

--- . (xfs /dev/vdb) ioping statistics ---
9 requests completed in 2.09 ms, 36 KiB read, 4.30 k iops, 16.8 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 179.7 us / 232.4 us / 311.7 us / 49.1 us

7. Run fio test: 

[root@cinder-test mnt]# fio rand-write.fio 
writes: (g=0): rw=randwrite, bs=512-512/512-512/512-512, ioengine=libaio, iodepth=1
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/12928KB/0KB /s] [0/25.9K/0 iops] [eta 00m:00s]
writes: (groupid=0, jobs=1): err= 0: pid=18981: Wed Jun 28 09:49:44 2017
  write: io=389380KB, bw=12994KB/s, iops=25987, runt= 29967msec
    slat (usec): min=4, max=319, avg= 5.98, stdev= 1.89
    clat (usec): min=10, max=7595, avg=30.79, stdev=15.24
     lat (usec): min=33, max=7600, avg=36.98, stdev=15.37
    clat percentiles (usec):
     |  1.00th=[   29],  5.00th=[   29], 10.00th=[   30], 20.00th=[   30],
     | 30.00th=[   30], 40.00th=[   30], 50.00th=[   30], 60.00th=[   31],
     | 70.00th=[   31], 80.00th=[   31], 90.00th=[   31], 95.00th=[   33],
     | 99.00th=[   38], 99.50th=[   41], 99.90th=[   58], 99.95th=[   67],
     | 99.99th=[  262]
    lat (usec) : 20=0.01%, 50=99.81%, 100=0.17%, 250=0.01%, 500=0.01%
    lat (usec) : 750=0.01%
    lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%
  cpu          : usr=3.53%, sys=21.17%, ctx=778768, majf=0, minf=31
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=778760/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=389380KB, aggrb=12993KB/s, minb=12993KB/s, maxb=12993KB/s, mint=29967msec, maxt=29967msec

Disk stats (read/write):
  vdb: ios=0/776431, merge=0/3, ticks=0/22695, in_queue=22601, util=75.33%
Comment 14 Gorka Eguileor 2017-06-30 04:59:52 EDT
In comment #3 I mentioned: "With the RBD backend Cinder service has no affect on the live migration since both the attach/detach and initialize/terminate connection are basically NOOPs."

To make it clear for everyone here is the actual code:

Attach and detach are not defined in the RBD driver, so the code used is from the base class:

    def attach_volume(self, context, volume, instance_uuid, host_name,
                      mountpoint):
        """Callback for volume attached to instance or host."""
        pass

    def detach_volume(self, context, volume, attachment=None):
        """Callback for volume detached."""
        pass

Then the code to create and remove the exports do nothing as well:
    def create_export(self, context, volume, connector):
        """Exports the volume."""
        pass

    def remove_export(self, context, volume):
        """Removes an export for a logical volume."""
        pass

And finally the initialize and terminate connection code, which basically dumps the monitors state and pass it along with the configuration:

    def initialize_connection(self, volume, connector):
        hosts, ports = self._get_mon_addrs()
        data = {
            'driver_volume_type': 'rbd',
            'data': {
                'name': '%s/%s' % (self.configuration.rbd_pool,
                                   volume.name),
                'hosts': hosts,
                'ports': ports,
                'cluster_name': self.configuration.rbd_cluster_name,
                'auth_enabled': (self.configuration.rbd_user is not None),
                'auth_username': self.configuration.rbd_user,
                'secret_type': 'ceph',
                'secret_uuid': self.configuration.rbd_secret_uuid,
                'volume_id': volume.id,
                "discard": True,
            }
        }
        LOG.debug('connection data: %s', data)
        return data

    def terminate_connection(self, volume, connector, **kwargs):
        pass

As you can see, there's not much in terms of what Cinder is doing there, and sure Cinder could be reporting different monitors or in different order, but I checked the "hosts" and "ports" from the connection information that Cinder is returning in the logs I had here from the other day and they are all the same in all entries:

  grep -RnI '"driver_volume_type": "rbd"' */var/log/nova/*.log

Which gives us:

 hosts: ["192.168.23.12", "192.168.23.13", "192.168.23.14"]
 ports: ["6789", "6789", "6789"]

So I reiterate my original assessment, this is not a Cinder problem.
Comment 15 Kashyap Chamarthy 2017-07-03 09:11:02 EDT
Thanks, Gorka for the Cinder-level analysis.


Meanwhile, Stefan Hajnoczi (QEMU I/O)experts looked at the `fio` 
results, and he writes on IRC:

    "The `fio` results [from
    https://bugzilla.redhat.com/show_bug.cgi?id=1463897#c9] show that
    the Cinder disk is affected much more by tail latencies (99th
    percentile, etc).  When you compare Nova vs. Cinder before migration
    they have similar 50th percentile latency.  Post-migration the
    Cinder disk is suddenly experiencing ~98th percentile latency *all
    the time*.  So it becomes super-slow."

So, the next suggested step is tracing 'librbd' write and request 
completion, and try either with `perf` probes:

    perf probe -x /usr/lib64/librbd.so rbd_aio_write
    perf probe -x /usr/lib64/librbd.so rbd_aio_read  

This requires:

(1) The Kernel be compiled with:

    CONFIG_UPROBE=y
        CONFIG_UPROBE_EVENTS=y

(2) And have debug RPMs installed for librbd.so.

Jason Dillaman, from the Ceph project also suggested (over e-mail) to
also perform some performance testing with 'librbd' cache disabled.
Perhaps with something like the following in ceph.conf:

    [client]
    rbd cache = false

Another approach, Jason suggested, as opposed to `perf` is the
following:

    You can just use LTTng-UST directly with a standard kernel by adding                                                                                                                    
    "rbd tracing = true" in your ceph.conf [1] and use "rbd-replay-prep                                                                                                                       
    --verbose" to dump all the traced events. 

    [1] http://docs.ceph.com/docs/master/rbd/rbd-replay/
Comment 16 Stefan Hajnoczi 2017-07-06 11:11:32 EDT
(In reply to Kashyap Chamarthy from comment #15)
> Thanks, Gorka for the Cinder-level analysis.
> 
> 
> Meanwhile, Stefan Hajnoczi (QEMU I/O)experts looked at the `fio` 
> results, and he writes on IRC:
> 
>     "The `fio` results [from
>     https://bugzilla.redhat.com/show_bug.cgi?id=1463897#c9] show that
>     the Cinder disk is affected much more by tail latencies (99th
>     percentile, etc).  When you compare Nova vs. Cinder before migration
>     they have similar 50th percentile latency.  Post-migration the
>     Cinder disk is suddenly experiencing ~98th percentile latency *all
>     the time*.  So it becomes super-slow."
> 
> So, the next suggested step is tracing 'librbd' write and request 
> completion, and try either with `perf` probes:
> 
>     perf probe -x /usr/lib64/librbd.so rbd_aio_write
>     perf probe -x /usr/lib64/librbd.so rbd_aio_read  
> 
> This requires:
> 
> (1) The Kernel be compiled with:
> 
>     CONFIG_UPROBE=y
>         CONFIG_UPROBE_EVENTS=y

The RHEL 7.3 kernel has these config options enabled.  It is not necessary to compile a custom kernel.
Comment 17 Kashyap Chamarthy 2017-07-13 06:16:00 EDT
Thanks, Stefan.  Indeed, I just meant to imply that the Kernel has those 
probes configured for completeness' sake.

Test
----

(1) Add the probes:

    $ perf probe -x /usr/lib64/librbd.so.1 rbd_aio_read
    $ perf probe -x /usr/lib64/librbd.so.1 rbd_aio_write
    $ perf probe -x /usr/lib64/librbd.so.1 rbd_aio_release

(2) Add everything (read, write, and completion) onto a single
    comand-line:

    $ perf record -a \
        -e probe_librbd:rbd_aio_read \
        -e probe_librbd:rbd_aio_write \
        -e probe_librbd:rbd_aio_release

(3) Then run a single O_DIRECT sequential I/O benchmark (e.g.reads)
    inside the guest:

    $ dd if=/dev/vda of=/dev/null iflag=direct count=50000

(4) Workout a script to calculate the request latencies from the
    perf-script(1) output.

Perform the above test for all the four combinations: (a) with & without
RBD cache enabled; (b) pre- and post-migration.


Now, there are four tests to be done:

- Test-1: [Pre-migration] On source Compute node, do `perf-probe` test, 
          with RBD cache enabled

    Example result for this:

    http://file.emea.redhat.com/~kchamart/Nova-bug-1463897/output-all-three-probes-sequential-reads-pre-migration-RBD-cache-enabled.txt

- Test-2: [Pre-migration] On source Compute node, do `perf-probe` test, 
          with RBD cache disabled
- Test-3: [Post-migration] On destination Compute node, do `perf-probe`
          test with RBD cache enabled
- Test-4: [Post-migration] On destination Compute node, do `perf-probe`
          test with RBD cache disabled
Comment 19 Anil Dhingra 2017-07-19 03:39:14 EDT
tried to replicate this is in physical lab 1Ctrl + 2 compute + 3 ceph node with 1 Os disk + 1 osd disk & results were same after shutdown / power on VM result was better but without shutdown 
its was almost 75% degraded . Let me try do perf  test also in LAB

If you need setup details please do let me know - i can keep this setup for 1-2 days .

Added new device & ioping test result approx 22 MB
~~~~~~~~~~~~~~
[root@vm1 ~]# mkdir /test
[root@vm1 ~]# mount /dev/vdb /test
[root@vm1 ~]# 
[root@vm1 ~]# cd /test
[root@vm1 test]# ioping .
4 KiB <<< . (xfs /dev/vdb): request=1 time=124.0 us (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=235.9 us
4 KiB <<< . (xfs /dev/vdb): request=3 time=145.1 us
4 KiB <<< . (xfs /dev/vdb): request=4 time=120.3 us
4 KiB <<< . (xfs /dev/vdb): request=5 time=145.8 us
4 KiB <<< . (xfs /dev/vdb): request=6 time=177.8 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=137.9 us
4 KiB <<< . (xfs /dev/vdb): request=8 time=147.0 us
4 KiB <<< . (xfs /dev/vdb): request=9 time=133.5 us
4 KiB <<< . (xfs /dev/vdb): request=10 time=237.6 us (slow)
4 KiB <<< . (xfs /dev/vdb): request=11 time=216.2 us
4 KiB <<< . (xfs /dev/vdb): request=12 time=238.9 us (slow)
4 KiB <<< . (xfs /dev/vdb): request=13 time=105.6 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=14 time=143.7 us
4 KiB <<< . (xfs /dev/vdb): request=15 time=231.5 us
^C
--- . (xfs /dev/vdb) ioping statistics ---
14 requests completed in 2.42 ms, 56 KiB read, 5.79 k iops, 22.6 MiB/s
generated 15 requests in 14.5 s, 60 KiB, 1 iops, 4.14 KiB/s
min/avg/max/mdev = 105.6 us / 172.6 us / 238.9 us / 47.0 us


~~~~~~~~ below during live migration ~~~~~~

[root@vm1 test]# ioping .
4 KiB <<< . (xfs /dev/vdb): request=1 time=82.8 us (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=234.6 us
4 KiB <<< . (xfs /dev/vdb): request=3 time=439.3 us
4 KiB <<< . (xfs /dev/vdb): request=4 time=265.8 us
4 KiB <<< . (xfs /dev/vdb): request=5 time=259.4 us
4 KiB <<< . (xfs /dev/vdb): request=6 time=232.5 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=207.5 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=8 time=260.3 us
4 KiB <<< . (xfs /dev/vdb): request=9 time=270.5 us
4 KiB <<< . (xfs /dev/vdb): request=10 time=1.28 ms (slow)
4 KiB <<< . (xfs /dev/vdb): request=11 time=970.4 us
4 KiB <<< . (xfs /dev/vdb): request=12 time=859.2 us
4 KiB <<< . (xfs /dev/vdb): request=13 time=899.6 us
4 KiB <<< . (xfs /dev/vdb): request=14 time=880.2 us
4 KiB <<< . (xfs /dev/vdb): request=15 time=918.4 us
4 KiB <<< . (xfs /dev/vdb): request=16 time=812.6 us
4 KiB <<< . (xfs /dev/vdb): request=17 time=909.3 us
4 KiB <<< . (xfs /dev/vdb): request=18 time=788.1 us
4 KiB <<< . (xfs /dev/vdb): request=19 time=870.0 us
4 KiB <<< . (xfs /dev/vdb): request=20 time=801.0 us
4 KiB <<< . (xfs /dev/vdb): request=21 time=953.8 us
4 KiB <<< . (xfs /dev/vdb): request=22 time=826.5 us
4 KiB <<< . (xfs /dev/vdb): request=23 time=1.05 ms
4 KiB <<< . (xfs /dev/vdb): request=24 time=855.8 us
4 KiB <<< . (xfs /dev/vdb): request=25 time=870.9 us
4 KiB <<< . (xfs /dev/vdb): request=26 time=778.8 us
4 KiB <<< . (xfs /dev/vdb): request=27 time=1.07 ms
4 KiB <<< . (xfs /dev/vdb): request=28 time=831.7 us
4 KiB <<< . (xfs /dev/vdb): request=29 time=877.2 us
4 KiB <<< . (xfs /dev/vdb): request=30 time=939.2 us
4 KiB <<< . (xfs /dev/vdb): request=31 time=1.09 ms
4 KiB <<< . (xfs /dev/vdb): request=32 time=949.5 us
4 KiB <<< . (xfs /dev/vdb): request=33 time=992.6 us
4 KiB <<< . (xfs /dev/vdb): request=34 time=948.4 us
4 KiB <<< . (xfs /dev/vdb): request=35 time=956.6 us
4 KiB <<< . (xfs /dev/vdb): request=36 time=1.10 ms
4 KiB <<< . (xfs /dev/vdb): request=37 time=914.6 us
4 KiB <<< . (xfs /dev/vdb): request=38 time=925.5 us
4 KiB <<< . (xfs /dev/vdb): request=39 time=938.7 us
4 KiB <<< . (xfs /dev/vdb): request=40 time=961.5 us
4 KiB <<< . (xfs /dev/vdb): request=41 time=910.6 us
4 KiB <<< . (xfs /dev/vdb): request=42 time=918.7 us
4 KiB <<< . (xfs /dev/vdb): request=43 time=1.11 ms
4 KiB <<< . (xfs /dev/vdb): request=44 time=1.02 ms
4 KiB <<< . (xfs /dev/vdb): request=45 time=1.01 ms
4 KiB <<< . (xfs /dev/vdb): request=46 time=974.3 us
4 KiB <<< . (xfs /dev/vdb): request=47 time=1.04 ms
4 KiB <<< . (xfs /dev/vdb): request=48 time=1.11 ms
4 KiB <<< . (xfs /dev/vdb): request=49 time=940.2 us
4 KiB <<< . (xfs /dev/vdb): request=50 time=948.7 us
4 KiB <<< . (xfs /dev/vdb): request=51 time=1.04 ms
4 KiB <<< . (xfs /dev/vdb): request=52 time=974.6 us
4 KiB <<< . (xfs /dev/vdb): request=53 time=932.9 us
4 KiB <<< . (xfs /dev/vdb): request=54 time=825.3 us
4 KiB <<< . (xfs /dev/vdb): request=55 time=934.0 us
4 KiB <<< . (xfs /dev/vdb): request=56 time=1.09 ms
4 KiB <<< . (xfs /dev/vdb): request=57 time=1.03 ms
4 KiB <<< . (xfs /dev/vdb): request=58 time=935.7 us
4 KiB <<< . (xfs /dev/vdb): request=59 time=1.13 ms
4 KiB <<< . (xfs /dev/vdb): request=60 time=825.3 us
4 KiB <<< . (xfs /dev/vdb): request=61 time=965.7 us
4 KiB <<< . (xfs /dev/vdb): request=62 time=811.6 us
4 KiB <<< . (xfs /dev/vdb): request=63 time=891.3 us
4 KiB <<< . (xfs /dev/vdb): request=64 time=887.2 us
4 KiB <<< . (xfs /dev/vdb): request=65 time=833.5 us
4 KiB <<< . (xfs /dev/vdb): request=66 time=798.7 us
4 KiB <<< . (xfs /dev/vdb): request=67 time=1.10 ms
4 KiB <<< . (xfs /dev/vdb): request=68 time=932.7 us
4 KiB <<< . (xfs /dev/vdb): request=69 time=871.5 us
4 KiB <<< . (xfs /dev/vdb): request=70 time=704.2 us
4 KiB <<< . (xfs /dev/vdb): request=71 time=885.0 us
4 KiB <<< . (xfs /dev/vdb): request=72 time=867.3 us
4 KiB <<< . (xfs /dev/vdb): request=73 time=1.03 ms
4 KiB <<< . (xfs /dev/vdb): request=74 time=889 us
4 KiB <<< . (xfs /dev/vdb): request=75 time=993.1 us
4 KiB <<< . (xfs /dev/vdb): request=76 time=906.6 us
4 KiB <<< . (xfs /dev/vdb): request=77 time=1.04 ms
4 KiB <<< . (xfs /dev/vdb): request=78 time=765.8 us
4 KiB <<< . (xfs /dev/vdb): request=79 time=836.8 us
4 KiB <<< . (xfs /dev/vdb): request=80 time=783.8 us
4 KiB <<< . (xfs /dev/vdb): request=81 time=807.3 us
4 KiB <<< . (xfs /dev/vdb): request=82 time=744.9 us
4 KiB <<< . (xfs /dev/vdb): request=83 time=835.4 us
4 KiB <<< . (xfs /dev/vdb): request=84 time=722.7 us
4 KiB <<< . (xfs /dev/vdb): request=85 time=762.4 us
4 KiB <<< . (xfs /dev/vdb): request=86 time=774.9 us
4 KiB <<< . (xfs /dev/vdb): request=87 time=913.1 us
4 KiB <<< . (xfs /dev/vdb): request=88 time=881.3 us
4 KiB <<< . (xfs /dev/vdb): request=89 time=883.9 us
4 KiB <<< . (xfs /dev/vdb): request=90 time=780.9 us
4 KiB <<< . (xfs /dev/vdb): request=91 time=830.9 us
4 KiB <<< . (xfs /dev/vdb): request=92 time=897.6 us
4 KiB <<< . (xfs /dev/vdb): request=93 time=874.8 us
4 KiB <<< . (xfs /dev/vdb): request=94 time=814.1 us
^C
--- . (xfs /dev/vdb) ioping statistics ---
93 requests completed in 80.0 ms, 372 KiB read, 1.16 k iops, 4.54 MiB/s
generated 94 requests in 1.56 min, 376 KiB, 1 iops, 4.02 KiB/s
min/avg/max/mdev = 207.5 us / 860.5 us / 1.28 ms / 208.3 us
[root@vm1 test]# 

~~~~~~~~~~
[root@overcloud-controller-0 ~]# nova show VM1 | grep -i hypervi
| OS-EXT-SRV-ATTR:hypervisor_hostname  | overcloud-compute-0.localdomain                                                  |
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# nova show VM1 | grep -i hypervi
| OS-EXT-SRV-ATTR:hypervisor_hostname  | overcloud-compute-1.localdomain                                                  |
[root@overcloud-controller-0 ~]# 

~~~~~~~~~ 
~~~ tested again same result still near to 4 MB~~~

[root@vm1 test]# ioping .
4 KiB <<< . (xfs /dev/vdb): request=1 time=916.6 us (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=999.0 us
4 KiB <<< . (xfs /dev/vdb): request=3 time=957.0 us
4 KiB <<< . (xfs /dev/vdb): request=4 time=915.3 us
4 KiB <<< . (xfs /dev/vdb): request=5 time=893.7 us
4 KiB <<< . (xfs /dev/vdb): request=6 time=912.1 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=1.03 ms (slow)
4 KiB <<< . (xfs /dev/vdb): request=8 time=990.6 us
4 KiB <<< . (xfs /dev/vdb): request=9 time=1.03 ms (slow)
4 KiB <<< . (xfs /dev/vdb): request=10 time=887.2 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=11 time=949.0 us
4 KiB <<< . (xfs /dev/vdb): request=12 time=922.3 us
4 KiB <<< . (xfs /dev/vdb): request=13 time=1.12 ms (slow)
4 KiB <<< . (xfs /dev/vdb): request=14 time=917.6 us
4 KiB <<< . (xfs /dev/vdb): request=15 time=1.04 ms
4 KiB <<< . (xfs /dev/vdb): request=16 time=900.0 us
4 KiB <<< . (xfs /dev/vdb): request=17 time=987.8 us
4 KiB <<< . (xfs /dev/vdb): request=18 time=908.0 us
4 KiB <<< . (xfs /dev/vdb): request=19 time=1.03 ms
4 KiB <<< . (xfs /dev/vdb): request=20 time=921.8 us
^C
--- . (xfs /dev/vdb) ioping statistics ---
19 requests completed in 18.3 ms, 76 KiB read, 1.04 k iops, 4.05 MiB/s
generated 20 requests in 20.0 s, 80 KiB, 1 iops, 4.00 KiB/s
min/avg/max/mdev = 887.2 us / 964.2 us / 1.12 ms / 63.3 us
[root@vm1 test]# 

~~~ Even  after reboot no improvement~~~
[root@vm1 test]# reboot
Connection to 10.0.0.9 closed by remote host.
Connection to 10.0.0.9 closed.
[root@overcloud-controller-0 ~]# ip netns exec qdhcp-68a81ea2-e5aa-43ee-be7d-a0eabd60672b ssh -i mykey.pem cloud-user@10.0.0.9
Last login: Wed Jul 19 02:54:39 2017 from host-10-0-0-2.openstacklocal
[cloud-user@vm1 ~]$ sudo su -
Last login: Wed Jul 19 02:54:43 EDT 2017 on pts/0
[root@vm1 ~]# mount /dev/vdb /test                                     
[root@vm1 ~]# cd /test
[root@vm1 test]# ioping .            
4 KiB <<< . (xfs /dev/vdb): request=1 time=1.04 ms (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=942.4 us
4 KiB <<< . (xfs /dev/vdb): request=3 time=1.17 ms
4 KiB <<< . (xfs /dev/vdb): request=4 time=1.09 ms
4 KiB <<< . (xfs /dev/vdb): request=5 time=1.07 ms
4 KiB <<< . (xfs /dev/vdb): request=6 time=932.5 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=990.2 us
4 KiB <<< . (xfs /dev/vdb): request=8 time=1.00 ms
4 KiB <<< . (xfs /dev/vdb): request=9 time=1.06 ms
4 KiB <<< . (xfs /dev/vdb): request=10 time=1.10 ms
4 KiB <<< . (xfs /dev/vdb): request=11 time=931.1 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=12 time=1.12 ms
4 KiB <<< . (xfs /dev/vdb): request=13 time=1.00 ms
4 KiB <<< . (xfs /dev/vdb): request=14 time=931.1 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=15 time=1.01 ms
4 KiB <<< . (xfs /dev/vdb): request=16 time=1.17 ms (slow)
4 KiB <<< . (xfs /dev/vdb): request=17 time=1.14 ms
4 KiB <<< . (xfs /dev/vdb): request=18 time=968.0 us
^C
--- . (xfs /dev/vdb) ioping statistics ---
17 requests completed in 17.6 ms, 68 KiB read, 964 iops, 3.77 MiB/s
generated 18 requests in 17.1 s, 72 KiB, 1 iops, 4.20 KiB/s
min/avg/max/mdev = 931.1 us / 1.04 ms / 1.17 ms / 81.2 us

~~~~~~~~

~~~~~~~~~ Finally Did Shutdown & test result was bit close   ~~~

[root@vm1 test]# shutdown -h now
Connection to 10.0.0.9 closed by remote host.
Connection to 10.0.0.9 closed.
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# 
[root@overcloud-controller-0 ~]# ip netns exec qdhcp-68a81ea2-e5aa-43ee-be7d-a0eabd60672b ssh -i mykey.pem cloud-user@10.0.0.9
Last login: Wed Jul 19 03:19:44 2017 from host-10-0-0-2.openstacklocal
[cloud-user@vm1 ~]$ 
[cloud-user@vm1 ~]$ sudo su -
Last login: Wed Jul 19 03:19:48 EDT 2017 on pts/0
[root@vm1 ~]# 
[root@vm1 ~]# mount /dev/vdb /test
[root@vm1 ~]# cd /test
[root@vm1 test]# ioping .
4 KiB <<< . (xfs /dev/vdb): request=1 time=148.6 us (warmup)
4 KiB <<< . (xfs /dev/vdb): request=2 time=264.8 us
4 KiB <<< . (xfs /dev/vdb): request=3 time=270.3 us
4 KiB <<< . (xfs /dev/vdb): request=4 time=226.8 us
4 KiB <<< . (xfs /dev/vdb): request=5 time=180.9 us
4 KiB <<< . (xfs /dev/vdb): request=6 time=252.5 us
4 KiB <<< . (xfs /dev/vdb): request=7 time=179.3 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=8 time=222.9 us
4 KiB <<< . (xfs /dev/vdb): request=9 time=296.7 us (slow)
4 KiB <<< . (xfs /dev/vdb): request=10 time=200.9 us
4 KiB <<< . (xfs /dev/vdb): request=11 time=179.4 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=12 time=202.0 us
4 KiB <<< . (xfs /dev/vdb): request=13 time=192.8 us
4 KiB <<< . (xfs /dev/vdb): request=14 time=199.6 us
4 KiB <<< . (xfs /dev/vdb): request=15 time=295.5 us (slow)
4 KiB <<< . (xfs /dev/vdb): request=16 time=198.0 us
4 KiB <<< . (xfs /dev/vdb): request=17 time=178.0 us (fast)
4 KiB <<< . (xfs /dev/vdb): request=18 time=185.5 us
^C
--- . (xfs /dev/vdb) ioping statistics ---
17 requests completed in 3.73 ms, 68 KiB read, 4.56 k iops, 17.8 MiB/s
generated 18 requests in 17.1 s, 72 KiB, 1 iops, 4.21 KiB/s
min/avg/max/mdev = 178.0 us / 219.2 us / 296.7 us / 40.1 us

************************
Comment 20 Kashyap Chamarthy 2017-07-19 05:47:30 EDT
(In reply to Anil Dhingra from comment #19)
> tried to replicate this is in physical lab 1Ctrl + 2 compute + 3 ceph node
> with 1 Os disk + 1 osd disk & results were same after shutdown / power on VM
> result was better but without shutdown 
> its was almost 75% degraded . Let me try do perf  test also in LAB
> 
> If you need setup details please do let me know - i can keep this setup for
> 1-2 days .

Feel free to do the tests in your lab, and see if you can reproduce the problem.  And upload the `perf` results somewhere.  (Just a gentle note: if you have super long outputs, attach them as plain text files.)

I have my own (virtual) setup where I am trying to replicate this problem with 3 Controllers, 2 Computes, and 3 Ceph nodes.

I am done with first two `perf` tests, and have the data locally:

  - [Pre-migration] On source Compute, with RBD cache enabled
  - [Pre-migration] On source Compute, with RBD cache *disabled*

And I am yet to do the migration, and do next two `perf` tests: 

  - [Post-migration] On destination Compute, with RBD cache enabled
  - [Post-migration] On destination Compute, with RBD cache *disabled*

Then, parse them through a script I wrote to calculate the latencies between 'rbd_aio_read',  'rbd_aio_release' requests.  

Then I need help from a Ceph / QEMU I/O expert to help me with parsing the data.

[...]
Comment 22 Anil Dhingra 2017-07-19 07:16 EDT
Created attachment 1300975 [details]
perf test
Comment 23 Anil Dhingra 2017-07-19 07:18 EDT
Created attachment 1300978 [details]
perf test post
Comment 24 Anil Dhingra 2017-07-19 07:20:37 EDT
Tests Done with rbd cache enable  files attached

=============

Pre
====
[root@vm1 test]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000
50000+0 records in
50000+0 records out
25600000 bytes (26 MB) copied, 1.92778 s, 13.3 MB/s


Post migration took longer time
======
[root@vm1 test]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000
50000+0 records in
50000+0 records out
25600000 bytes (26 MB) copied, 29.2381 s, 876 kB/s


Source
======
[root@overcloud-compute-0 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release
^C[ perf record: Woken up 29 times to write data ]
[ perf record: Captured and wrote 8.101 MB perf.data (100000 samples) ]
[root@overcloud-compute-0 ~]# mv perf.data perf.data-with-cache-pre

Destination
==========
[root@overcloud-compute-1 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release
^C[ perf record: Woken up 28 times to write data ]
[ perf record: Captured and wrote 8.190 MB perf.data (100012 samples) ]

[root@overcloud-compute-1 ~]# mv perf.data perf.data-with-cache-post

~~
will disable RBD cache now & do a fresh test
~~
Comment 25 Anil Dhingra 2017-07-19 09:27 EDT
Created attachment 1301093 [details]
perf test without cache pre
Comment 26 Anil Dhingra 2017-07-19 09:28 EDT
Created attachment 1301094 [details]
perf test without cache post
Comment 27 Anil Dhingra 2017-07-19 09:30:08 EDT
Below test with disabled rbd cache

======
[heat-admin@overcloud-compute-1 ~]$ uptime
 12:00:36 up 4 min,  1 user,  load average: 0.03, 0.15, 0.08
[heat-admin@overcloud-compute-1 ~]$ tail -2 /etc/ceph/ceph.conf 
[client]
rbd_cache = false
[heat-admin@overcloud-compute-1 ~]$

Pre
~~~~

[root@overcloud-compute-1 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 1     instance-00000004              running

[root@overcloud-compute-1 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release
^C[ perf record: Woken up 28 times to write data ]
[ perf record: Captured and wrote 8.066 MB perf.data (100000 samples) ]

[root@overcloud-compute-1 ~]# mv perf.data perf.data-without-cache-pre

[root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000
50000+0 records in
50000+0 records out
25600000 bytes (26 MB) copied, 1.96597 s, 13.0 MB/s

*********

[root@overcloud-compute-0 ~]# uptime
 12:00:24 up 4 min,  1 user,  load average: 0.02, 0.11, 0.05
[root@overcloud-compute-0 ~]# tail -2 /etc/ceph/ceph.conf 
[client]
rbd_cache = false

post
~~~~
[root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=50000
50000+0 records in
50000+0 records out
25600000 bytes (26 MB) copied, 26.9762 s, 949 kB/s
[root@vm1 ~]#

[root@overcloud-compute-0 ~]# perf record -a -e probe_librbd:rbd_aio_read -e probe_librbd:rbd_aio_write -e probe_librbd:rbd_aio_release
^C[ perf record: Woken up 26 times to write data ]
[ perf record: Captured and wrote 8.127 MB perf.data (100000 samples) ]

[root@overcloud-compute-0 ~]# mv perf.data perf.data-without-cache-post
Comment 28 Stefan Hajnoczi 2017-07-19 14:12:07 EDT
I have calculated the following librbd read request latencies from Anil's perf data.

The dd mean latency is also included (calculated from Anil's dd output) for comparison - this shows that the perf results are sane.  The dd latency is librbd latency + QEMU + guest kernel latency.

Cache  Pre/Post  Median  Mean   dd
                 (us)    (us)   (us)
--------------------------------------
On     Pre       12      13     39
On     Post      494     501    585
Off    Pre       12      13     39
Off    Post      461     472    539

Two observations:

1. librbd is much slower after migration.  This indicates that the issue is at the Ceph level, not due to QEMU or the guest.

2. Cache on/off makes no difference.  This is odd.  The block size is 512 bytes and only 26 MB are read.  Therefore cache on should perform well while cache off should be slower.
Comment 29 Stefan Hajnoczi 2017-07-19 14:14 EDT
Created attachment 1301289 [details]
Latency analysis script

Example usage:

$ perf -i perf.data -s perf-latency.py
Think Time
Mean: 0.000025
Median: 0.000025
Min: 0.000023
Max: 0.000147
Histogram bucket size: 0.000112

0.000079: 49998
0.000191: 1
0.000304: 0
0.000416: 0
0.000529: 0
0.000641: 0
0.000754: 0
0.000866: 0
0.000979: 0
0.001091: 0

Read Latency
Mean: 0.000013
Median: 0.000012
Min: 0.000011
Max: 0.000521
Histogram bucket size: 0.000151

0.000087: 49997
0.000238: 1
0.000389: 1
0.000540: 1
0.000691: 0
0.000842: 0
0.000993: 0
0.001144: 0
0.001295: 0
0.001446: 0
Comment 31 Kashyap Chamarthy 2017-07-20 03:57:07 EDT
Hi Jason,

Given Stefan's observations (about the `perf` data for the four tests (noted in earlier comments) in comment#28, and comment#29, it appears to be Ceph is the problem.  Do you, or other Ceph team members have any advise here?
Comment 32 Shinobu KINJO 2017-07-20 05:32:47 EDT
(In reply to Kashyap Chamarthy from comment #31)
> Hi Jason,
> 
> Given Stefan's observations (about the `perf` data for the four tests (noted
> in earlier comments) in comment#28, and comment#29, it appears to be Ceph is
> the problem.  Do you, or other Ceph team members have any advise here?

At the moment we can not say "it appears to be Ceph is the problem" explicitly.
We need to do same tests with other backend.
Comment 34 Kashyap Chamarthy 2017-07-20 07:49 EDT
Created attachment 1301672 [details]
Perf test results for the four test cases (pre- and post-migration, with and without RBD cache enabled

The result are after running the raw `perf` data through Stefan's script perf-latency.py from comment#29.

So the attached text file contains the result of the four data sets ran through:

  [compute-0]$ perf script -i pre-mig-src-cache-yes.data -s perf-latency.py
  [compute-0]$ perf script -i pre-mig-src-cache-no.data -s perf-latency.py
  [compute-1]$ perf script -i post-mig-src-cache-no.data -s perf-latency.py
  [compute-1]$ perf script -i post-mig-src-cache-yes.data -s perf-latency.py
Comment 35 Stefan Hajnoczi 2017-07-20 08:02:09 EDT
(In reply to Shinobu KINJO from comment #32)
> (In reply to Kashyap Chamarthy from comment #31)
> > Hi Jason,
> > 
> > Given Stefan's observations (about the `perf` data for the four tests (noted
> > in earlier comments) in comment#28, and comment#29, it appears to be Ceph is
> > the problem.  Do you, or other Ceph team members have any advise here?
> 
> At the moment we can not say "it appears to be Ceph is the problem"
> explicitly.
> We need to do same tests with other backend.

The performance degradation observed by Anil *is* at the Ceph level.  The results in Comment #28 show that librbd's read latency increases after migration.

Once the guest and QEMU submit a request to librbd they are no longer involved until that request completes.  This is what Anil measured.

Therefore the next step is to focus on librbd and Ceph to see why read requests take longer after migration.  It's the next logical place to investigate because the results show that the latency degradation does not happen in the guest or QEMU.
Comment 36 Stefan Hajnoczi 2017-07-20 08:12:02 EDT
(In reply to Kashyap Chamarthy from comment #34)
> Created attachment 1301672 [details]
> Perf test results for the four test cases (pre- and post-migration, with and
> without RBD cache enabled
> 
> The result are after running the raw `perf` data through Stefan's script
> perf-latency.py from comment#29.
> 
> So the attached text file contains the result of the four data sets ran
> through:
> 
>   [compute-0]$ perf script -i pre-mig-src-cache-yes.data -s perf-latency.py
>   [compute-0]$ perf script -i pre-mig-src-cache-no.data -s perf-latency.py
>   [compute-1]$ perf script -i post-mig-src-cache-no.data -s perf-latency.py
>   [compute-1]$ perf script -i post-mig-src-cache-yes.data -s perf-latency.py

Here is a summary of what you've posted:

Cache  Pre/Post  Median  Mean   dd
                 (us)    (us)   (us)
--------------------------------------
On     Pre       695     725    ?
On     Post      775     802    ?
Off    Pre       667     712    ?
Off    Post      48      56     ?

Similar to Anil's results there is no performance difference between cache on/off pre migration.  This seems unusual for a sequential read test.

The post migration result with cache disabled is bizarre.  Performance was much better *after* migration?

Kashyap: Please post the disk I/O benchmark output from these runs so these numbers can be sanity checked.  It's clear from the number of I/O requests that you ran a different workload from Anil, but the details are important.

At this point I would say your results do not show a performance degradation after live migration in librbd.
Comment 41 Jason Dillaman 2017-07-20 10:46:20 EDT
After migration, the rbd cache for the Cinder volume is disabled:

*PRE-MIGRATION*

qemu      138779       1 99 14:42 ?        00:00:01 /usr/libexec/qemu-kvm -name guest=instance-00000004,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-12-instance-00000004/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 610b4ffd-5b36-467e-9f59-a718f2adc674 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.6-2.el7ost,serial=38873f97-1eae-4dc9-a74a-71b078bb59c9,uuid=610b4ffd-5b36-467e-9f59-a718f2adc674,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-12-instance-00000004/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=hHm8kz+svBrnCmXp4oeuXFWx15+mSpKtLMKkzoTij2Y=,keyid=masterKey0,iv=q2IMnNhkk673f3rQYNEIeg==,format=base64 -drive file=rbd:vms/610b4ffd-5b36-467e-9f59-a718f2adc674_disk:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -object secret,id=virtio-disk1-secret0,data=FYWo2AMkPmp5BGS6BMQn6LBvp6IIuamaaA5vCKx/sTU=,keyid=masterKey0,iv=YoJFXG+EDhGSCKwWVwxrJA==,format=base64 -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=writeback -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:18:0d:aa,bus=pci.0,addr=0x3 -add-fd set=2,fd=31 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.74.160.8:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on


## -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=writeback


** POST-MIGRATION **


qemu      135842       1  1 14:09 ?        00:00:30 /usr/libexec/qemu-kvm -name guest=instance-00000004,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-11-instance-00000004/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 8096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid 610b4ffd-5b36-467e-9f59-a718f2adc674 -smbios type=1,manufacturer=Red Hat,product=OpenStack Compute,version=14.0.6-2.el7ost,serial=38873f97-1eae-4dc9-a74a-71b078bb59c9,uuid=610b4ffd-5b36-467e-9f59-a718f2adc674,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-11-instance-00000004/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=5mU5udvfMsCiRB8giamW+f5ECusfzCXHn+jwE0c5ldI=,keyid=masterKey0,iv=vOlnzyPWhQk2Nm1URL0Saw==,format=base64 -drive file=rbd:vms/610b4ffd-5b36-467e-9f59-a718f2adc674_disk:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -object secret,id=virtio-disk1-secret0,data=mAx00CIXnxK7DoOaQfor6yah3m2HMCVkT8eIKIvmxIw=,keyid=masterKey0,iv=ICdwnOp9R0AwNzKL8BWf+Q==,format=base64 -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=28,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:18:0d:aa,bus=pci.0,addr=0x3 -add-fd set=2,fd=32 -chardev file,id=charserial0,path=/dev/fdset/2,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 10.74.160.8:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on

## -drive file=rbd:volumes/volume-c04ed0aa-1cef-462c-847f-05cda2d2ed3c:id=openstack:auth_supported=cephx\;none:mon_host=10.74.160.36\:6789,file.password-secret=virtio-disk1-secret0,format=raw,if=none,id=drive-virtio-disk1,serial=c04ed0aa-1cef-462c-847f-05cda2d2ed3c,cache=none


Looks like perhaps libvirt/QEMU/Nova is dropping the Cinder cache setting (note that the Nova ephemeral volume retains its writeback cache setting after migration).
Comment 42 Jason Dillaman 2017-07-20 10:49:32 EDT
I also enabled Ceph admin sockets and logging and confirmed that cache was enabled pre-migration and disabled post-migration on the Cinder volume:

** PRE-MIGRATION **

# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.138779.140668137822720.asok config get rbd_cache
{
    "rbd_cache": "true"
}


** POST-MIGRATION **

# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.135842.139761430051328.asok config get rbd_cache
{
    "rbd_cache": "false"
}
Comment 43 Kashyap Chamarthy 2017-07-20 13:33:43 EDT
Nice catch, Jason!

So, from a chat with Daniel Berrangé on IRC, from a Nova-perspective: "Prior to live migration, Nova rewrites all the <disk> elements, and passes this updated guest XML across to target libvirt. And it is never calling _set_disk_cache() when doing this. So `nova.conf`'s `writeback` setting is getting lost, leaving the default `cache=none` setting. And this mistake (of leaving the default cache value to 'none') will of course be correct when you reboot the guest on the target later."

So we need to set the cache mode in _get_volume_config() method in Nova -- because it is a callback function to _update_volume_xml() in nova/virt/libvirt/migration.py. 

To that effect, here's a Nova upstream change: 

    https://review.openstack.org/#/c/485752/ -- libvirt/driver.py: Set 
    cache value for Cinder volume post-migration
Comment 44 Shinobu KINJO 2017-07-20 20:05:46 EDT
(In reply to Jason Dillaman from comment #42)
> I also enabled Ceph admin sockets and logging and confirmed that cache was
> enabled pre-migration and disabled post-migration on the Cinder volume:
> 
> ** PRE-MIGRATION **
> 
> # ceph --admin-daemon
> /var/run/qemu/ceph-client.openstack.138779.140668137822720.asok config get
> rbd_cache
> {
>     "rbd_cache": "true"
> }
> 
> 
> ** POST-MIGRATION **
> 
> # ceph --admin-daemon
> /var/run/qemu/ceph-client.openstack.135842.139761430051328.asok config get
> rbd_cache
> {
>     "rbd_cache": "false"
> }

That's good catch. But different condition of < rbd_cache > does not contribute to the performance degradation.
The followings are test case with enabling < rbd_cache > on both source and destination hosts explicitly.

// *** SOURCE HOST ***
// *** Condition of < rbd_cache > ***
[root@overcloud-compute-0 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.184066.140268954968064.asok config get rbd_cache
{
    "rbd_cache": "true"
}

// *** DO DD ***
[root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=5000
5000+0 records in
5000+0 records out
2560000 bytes (2.6 MB) copied, 0.191401 s, 13.4 MB/s

// *** DO MIGRATION ***

// *** DESTINATION ***
// *** Condition of < rbd_cache > ***
[root@overcloud-compute-1 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.181268.139699450011648.asok config get rbd_cache
{
    "rbd_cache": "true"
}

// *** DO DD ***
[root@vm1 ~]# dd if=/dev/vdb of=/dev/null iflag=direct count=5000
5000+0 records in
5000+0 records out
2560000 bytes (2.6 MB) copied, 3.32248 s, 771 kB/s
Comment 45 Jason Dillaman 2017-07-20 20:33:20 EDT
@Shinobu: how did you "explicitly" enable the cache? The value in "ceph.conf" is overridden by the QEMU setting. Also, are you 100% positive you are using the correct ASOK file? There should be two files present (one for the Nova ephemeral volume and the other for Cinder). Also note that running "config set rbd_cache true" doesn't dynamically enable the cache but does change the configuration setting.
Comment 47 Shinobu KINJO 2017-07-20 20:48:36 EDT
(In reply to Jason Dillaman from comment #45)
> @Shinobu: how did you "explicitly" enable the cache? The value in
> "ceph.conf" is overridden by the QEMU setting. Also, are you 100% positive
> you are using the correct ASOK file? There should be two files present (one
> for the Nova ephemeral volume and the other for Cinder). Also note that
> running "config set rbd_cache true" doesn't dynamically enable the cache but
> does change the configuration setting.

If I use inappropriate socket file, I should not be able to get config value, should I? Is what I saw something wrong?

[root@overcloud-compute-0 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.184066.140268954968064.asok config get rbd_cache
{
    "rbd_cache": "true"
}

[root@overcloud-compute-1 ~]# ceph --admin-daemon /var/run/qemu/ceph-client.openstack.181268.139699450011648.asok config get rbd_cache
{
    "rbd_cache": "true"
}

// ******

[root@overcloud-compute-0 ~]# cat /etc/ceph/ceph.conf | grep cache
rbd_cache = true

[root@overcloud-compute-1 ~]# cat /etc/ceph/ceph.conf | grep cache
rbd_cache = true
Comment 49 Jason Dillaman 2017-07-20 21:07:57 EDT
@Shinobu: you didn't really answer my question. There are two ASOK files per instance of qemu since it has two disks attached. You checked two ASOK files on two different hosts. Also, as I mentioned, the QEMU cache setting overrides what you have in your ceph.conf, so grepping that file doesn't imply anything. 

Regardless, I also looked at the raw logs and saw that the actual IO latency between librbd and the OSD was consistent between pre- and post- migration. That is why I looked into the cache settings and confirmed that the Cinder cache was being disabled upon migration. I recommend a re-test with the proposed fix applied to Nova.
Comment 51 Shinobu KINJO 2017-07-20 21:26:22 EDT
(In reply to Jason Dillaman from comment #49)
> @Shinobu: you didn't really answer my question. There are two ASOK files per
> instance of qemu since it has two disks attached. You checked two ASOK files
> on two different hosts. Also, as I mentioned, the QEMU cache setting
> overrides what you have in your ceph.conf, so grepping that file doesn't
> imply anything. 
> 
> Regardless, I also looked at the raw logs and saw that the actual IO latency
> between librbd and the OSD was consistent between pre- and post- migration.
> That is why I looked into the cache settings and confirmed that the Cinder
> cache was being disabled upon migration. I recommend a re-test with the
> proposed fix applied to Nova.

That makes sense.
Comment 55 Kashyap Chamarthy 2017-07-29 11:21:04 EDT
The patch for Git master is merged.

And here's the upstream stable/newton backport, in-progress:

    https://review.openstack.org/#/c/488959/
Comment 57 Kashyap Chamarthy 2017-09-18 09:37:30 EDT
Verification notes for this bug:

*Without* this bug fix (from openstack-nova-14.0.8-2.el7ost), when you 
migrate a Nova instance with a Cinder volume -- where both Nova
instance's disk and the Cinder volume are on Ceph -- the cache value for 
the Cinder volume (erroneously) changes from 'writeback' to 'none':

    [Check by doing `ps -ef | grep qemu`, and look for the relevant QEMU
    process associated with the Nova instance.]

    Pre-migration, QEMU command-line for the Nova instance:

        [...] -drive file=rbd:volumes/volume-[...],cache=writeback

    Post-migration, QEMU command-line for the Nova instance:

        [...] -drive file=rbd:volumes/volume-[...],cache=none

*With* the bug fix (from openstack-nova-14.0.8-2.el7ost), the cache 
value for the Cinder volume should remain 'writeback':

    Pre-migration, QEMU command-line for the Nova instance:

        [...] -drive file=rbd:volumes/volume-[...],cache=writeback

    Post-migration, QEMU command-line for the Nova instance:

        [...] -drive file=rbd:volumes/volume-[...],cache=writeback
Comment 62 errata-xmlrpc 2017-09-28 12:32:44 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2823

Note You need to log in before you can comment on or make changes to this bug.