Description of problem: nova instance performance issues while using ceph backend Version-Release number of selected component (if applicable): RHEL OSP 8 How reproducible: Everytime for cu. Steps to Reproduce: 1. Spawn an instance using qcow2 image or using volume created from qcow2 image. 2. Spawn an instance using raw image or using volume created from raw image. It will create parent/child relation in both cases. 3. Instance spawned in Step 1 giving twice performance in comparison to Step 2 instance while running dd test from instance. ~~~ Qcow2 image : [root@dawcow2image ~]# dd if=/dev/zero of=file1 bs=1024k count=1024 conv=fdatasync 1024+0 record dentro 1024+0 record fuori 1073741824 byte (1,1 GB) copiati, 2,18062 s, 492 MB/s raw image : [root@darawimage ~]# dd if=/dev/zero of=file1 bs=1024k count=1024 conv=fdatasync 1024+0 record dentro 1024+0 record fuori 1073741824 byte (1,1 GB) copiati, 5,21842 s, 206 MB/s [root@darawimage ~]# ~~~ Once we flatten the Step 2 instance backend on ceph then both instances are giving equal performance. Actual results: Performance difference is very large. Expected results: Performance difference should not be so large and what has been changed in driver code which is creating P/C relation for ceph volumes with raw image on OSP 8 but not on OSP 9. Additional info: +++++++++++ OSP 7 Setup +++++++++++ No P/C relation exist when we are creating a cinder volume from a raw image. Step 1 : Checking the image status in glance. Raw image : ~~~ # rbd -p images ls -l | grep bd0f6352-ca64-4476-94e0-a132d56b4399 bd0f6352-ca64-4476-94e0-a132d56b4399 10240M 2 bd0f6352-ca64-4476-94e0-a132d56b4399@snap 10240M 2 yes # qemu-img info rbd:images/bd0f6352-ca64-4476-94e0-a132d56b4399 image: rbd:images/bd0f6352-ca64-4476-94e0-a132d56b4399 file format: raw virtual size: 10G (10737418240 bytes) disk size: unavailable cluster_size: 8388608 Snapshot list: ID TAG VM SIZE DATE VM CLOCK snap snap 10G 1970-01-01 05:30:00 00:00:00.000 ~~~ qcow2 image : ~~~ # rbd -p images ls -l | grep 7c764b1f-4726-4a92-834d-e117be22d0bf 7c764b1f-4726-4a92-834d-e117be22d0bf 452M 2 7c764b1f-4726-4a92-834d-e117be22d0bf@snap 452M 2 yes # qemu-img info rbd:images/7c764b1f-4726-4a92-834d-e117be22d0bf image: rbd:images/7c764b1f-4726-4a92-834d-e117be22d0bf file format: qcow2 virtual size: 10G (10737418240 bytes) disk size: unavailable cluster_size: 65536 Format specific information: compat: 0.10 refcount bits: 16 ~~~ Step 2 : Created volumes using the images. Raw volume : Size of original raw image. ~~~ # rbd -p volumes ls -l | grep 884273be-4361-4f62-b211-b620a64e0a76 volume-884273be-4361-4f62-b211-b620a64e0a76 10240M 2 ~~~ qcow2 volume : size more than qcow2 image. ~~~ # rbd -p volumes ls -l | grep 9c7034c3-0f5c-427e-a414-0798ccdf2694 volume-9c7034c3-0f5c-427e-a414-0798ccdf2694 40960M 2 ~~~ Step 3 : Spawned instances using volumes. # nova list +--------------------------------------+-----------------+--------+------------+-------------+-------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-----------------+--------+------------+-------------+-------------------------+ | 0cd131d5-8295-4c04-b159-671a6dda1c33 | qcow2-instance1 | ACTIVE | - | Running | internal=192.168.122.73 | | ee5bcf8e-d921-4948-84a9-0d6131ae07af | raw-instance1 | ACTIVE | - | Running | internal=192.168.122.72 | +--------------------------------------+-----------------+--------+------------+-------------+-------------------------+ Step 4 : Checking the children of the raw image nothing is getting displayed. qcow2 image : # rbd -p images children 7c764b1f-4726-4a92-834d-e117be22d0bf@snap raw image : # rbd -p images children bd0f6352-ca64-4476-94e0-a132d56b4399@snap +++++++++++ OSP 8 Setup +++++++++++ P/C relation exist when we are creating a cinder volume from a raw image. Step 1 : Created two glance images qcow2 and raw. ~~~ $ rbd -p images ls -l NAME SIZE PARENT FMT PROT LOCK 212af2e3-14fc-4c93-a8a1-75fa0449abb8 10240M 2 212af2e3-14fc-4c93-a8a1-75fa0449abb8@snap 10240M 2 yes e954573b-5e75-4612-a80b-290134e07905 472M 2 e954573b-5e75-4612-a80b-290134e07905@snap 472M 2 yes ~~~ Step 2 : No children is present by default. ~~ $ rbd -p images children 212af2e3-14fc-4c93-a8a1-75fa0449abb8@snap $ rbd -p images children e954573b-5e75-4612-a80b-290134e07905@snap ~~~ Step 3 : Created ceph volumes using both images. ~~~ $ cinder create --image-id e954573b-5e75-4612-a80b-290134e07905 --display-name qcow2-volume1 20 +---------------------------------------+--------------------------------------+ | Property | Value | +---------------------------------------+--------------------------------------+ | attachments | [] | | availability_zone | nova | | bootable | false | | consistencygroup_id | None | | created_at | 2016-10-05T04:08:16.000000 | | description | None | | encrypted | False | | id | 1247c6d1-f7e2-49ef-b055-91002d1f817e | | metadata | {} | | migration_status | None | | multiattach | False | | name | qcow2-volume1 | | os-vol-host-attr:host | hostgroup@tripleo_ceph#tripleo_ceph | | os-vol-mig-status-attr:migstat | None | | os-vol-mig-status-attr:name_id | None | | os-vol-tenant-attr:tenant_id | 743020b644804528b898ae3fd6a5a558 | | os-volume-replication:driver_data | None | | os-volume-replication:extended_status | None | | replication_status | disabled | | size | 20 | | snapshot_id | None | | source_volid | None | | status | creating | | user_id | b18953ac5c7f40b1953beb64f086c55b | | volume_type | None | +---------------------------------------+--------------------------------------+ $ cinder create --image-id 212af2e3-14fc-4c93-a8a1-75fa0449abb8 --display-name raw-volume1 20 +---------------------------------------+--------------------------------------+ | Property | Value | +---------------------------------------+--------------------------------------+ | attachments | [] | | availability_zone | nova | | bootable | false | | consistencygroup_id | None | | created_at | 2016-10-05T04:16:42.000000 | | description | None | | encrypted | False | | id | 7ef90ad8-1ba9-45fc-b3ea-d9ac27eeb18b | | metadata | {} | | migration_status | None | | multiattach | False | | name | raw-volume1 | | os-vol-host-attr:host | hostgroup@tripleo_ceph#tripleo_ceph | | os-vol-mig-status-attr:migstat | None | | os-vol-mig-status-attr:name_id | None | | os-vol-tenant-attr:tenant_id | 743020b644804528b898ae3fd6a5a558 | | os-volume-replication:driver_data | None | | os-volume-replication:extended_status | None | | replication_status | disabled | | size | 20 | | snapshot_id | None | | source_volid | None | | status | creating | | user_id | b18953ac5c7f40b1953beb64f086c55b | | volume_type | None | +---------------------------------------+--------------------------------------+ ~~~ Step 4 : Parent/child relation is showing for raw image. ~~~ # rbd -p images children e954573b-5e75-4612-a80b-290134e07905@snap # rbd -p images children 212af2e3-14fc-4c93-a8a1-75fa0449abb8@snap volumes/volume-7ef90ad8-1ba9-45fc-b3ea-d9ac27eeb18b ~~~ ========================================== Tests created with rbd is not showing such huge difference however order of magnitude difference is seen in number of operations per second between compute and controller node. +++++++++++++++++ From compute node +++++++++++++++++ CASE 1 [root@overcloud-compute-0 ~]# rbd create volumes/data-disk1 -s 204800 --image-format 2 [root@overcloud-compute-0 ~]# rbd bench-write volumes/data-disk1 --io-size 4096 --io-threads 16 --io-total 10000000000 --io-pattern seq bench-write io_size 4096 io_threads 16 bytes 10000000000 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 133616 133642.59 547400032.10 2 273940 136983.43 561084135.83 3 413982 138003.01 565260308.65 4 554961 138746.85 568307080.88 5 699292 139863.75 572881913.64 6 839915 141229.15 578474586.94 7 983485 141909.04 581259434.77 8 1124749 142153.32 582259991.72 9 1267997 142607.29 584119447.63 10 1410098 142161.22 582292350.78 11 1550587 142165.28 582308990.86 12 1693153 141933.57 581359897.00 13 1834038 141857.92 581050028.05 14 1974950 141390.69 579136247.40 15 2118564 141693.16 580375181.91 16 2263242 142530.95 583806782.64 17 2398967 141162.67 578202276.85 elapsed: 17 ops: 2441407 ops/sec: 140094.59 bytes/sec: 573827429.96 CASE 2 [root@overcloud-compute-0 ~]# rbd create volumes/data-disk2 -s 204800 --image-format 2 [root@overcloud-compute-0 ~]# rbd snap create volumes/data-disk2@snap [root@overcloud-compute-0 ~]# rbd snap protect volumes/data-disk2@snap [root@overcloud-compute-0 ~]# rbd clone volumes/data-disk2@snap volumes/data-disk3 [root@overcloud-compute-0 ~]# rbd -p volumes ls -l 2016-10-06 07:48:30.235024 7f6310b0f7c0 -1 librbd::ImageCtx: error reading immutable metadata: (2) No such file or directory 2016-10-06 07:48:30.592282 7f6310b0f7c0 -1 librbd::ImageCtx: error reading immutable metadata: (2) No such file or directory NAME SIZE PARENT FMT PROT LOCK data-disk1 200G 2 data-disk2 200G 2 data-disk2@snap 200G 2 yes data-disk3 200G volumes/data-disk2@snap 2 volume-00c9b6d3-df75-4df3-a6ee-c20d7c5846ff 10240M 2 volume-1177c82e-7e39-439a-a71f-f9b723c7cb52 10240M 2 volume-183f6b49-16b9-4eaf-a5ac-63112a33be24 102400M 2 volume-1b77b178-2540-436b-a14d-083e3a5a87c2 20480M 2 volume-1d417105-10e4-44bc-91ce-7f8a712e49ca 10240M 2 volume-26b256cb-ed41-4569-ae3f-d5e800070a1c 26624M volumes/volume-c45d9aac-76f1-4417-b208-853f3ce078e8 2 volume-26d4c4b3-1ee4-46fe-bf3a-f63c59553d8b 10240M 2 volume-2f959a39-7010-4da8-8106-b5716e04c648 10240M 2 volume-36d68b80-d0e4-4a48-b919-b43815a04054 10240M 2 volume-3a9573d7-1df5-4cbe-914c-48275a6e530c 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-44212b31-06e6-4f2c-ad35-e8156fe46dae 10240M 2 volume-4d96e569-6711-4ac0-85e1-3a4c13258a4e 10240M 2 volume-4ff2d560-74bd-4df1-bc3e-aee7bd6ecc2b 20480M 2 volume-52eb29c3-0bc0-4e2b-885d-27beabf95733 10240M 2 volume-55b01632-cc3d-4200-93fd-a2b326f39711 10240M 2 volume-56fb6c23-c8f2-475e-9960-8cc187b1cedd 10240M 2 volume-615b4bfa-51ce-4529-ae65-4361e67d0252 10240M 2 volume-644cfdda-9d94-4784-ba76-7487e1f4ae6b 20480M 2 volume-661a8823-fbf2-40f2-b8ac-1789e3c8bb3f 10240M 2 volume-661b765e-68ec-4b9b-9482-6289c8f0d470 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-686931f1-0c97-4a28-a23a-e21470c40dbe 10240M 2 volume-6931d441-6dbe-4b51-81d4-208be35d27e0 10240M 2 volume-6d5b6019-5055-4683-af12-a47dd88d2c51 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-6d887507-a804-41e3-a633-4a5b0f651a90 10240M 2 volume-6e98a776-0679-417c-88b2-590394fa5bdc 10240M 2 volume-6ecc3b25-0cfb-46fb-8a10-2648ce1cc849 20480M 2 volume-72990018-260f-4880-939a-1b0566c2a262 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-88c88720-8963-4ecb-8dc3-6557681b98f1 20480M 2 volume-9b3c1ec4-7832-485e-94df-020b4b1fcd70 20480M 2 volume-9de05763-535a-4e28-8013-a28c3bd5dce1 10240M 2 volume-9debd6ff-760c-4c85-bcfa-7beec0614e48 10240M 2 volume-9ffa1c5f-f679-4261-b8d5-22371218f095 20480M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 volume-a87cc971-a70e-49a9-bfcb-e519652b6e9a 20480M 2 volume-adcc668d-8ae7-4e79-bf86-4ea5b9054fe2 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-b14f9c4f-4d05-4509-8482-adf63d4702a5 10240M 2 volume-b60e26e1-a7f9-425a-8eca-5b8ee0fe76df 40960M 2 volume-b765d604-81b6-4386-88d7-df9fa1ff76c5 20480M 2 volume-b7ba67ef-c247-456d-9558-db04fd91799d 102400M 2 volume-c30438fb-238b-4768-8080-85479576fd41 30720M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 volume-c45d9aac-76f1-4417-b208-853f3ce078e8 26624M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 volume-c45d9aac-76f1-4417-b208-853f3ce078e8 26624M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 yes volume-c9e89560-41bb-426b-9b8e-a06e9bcbb0e4 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-ce4ddc0b-31ce-4e6f-a481-73bd339befa7 10240M 2 volume-d125897c-b3d4-4547-825b-69b3ee08a2b6 10240M 2 volume-d40c5dd1-3ed8-4d3c-b628-1adc062a4d7b 10240M 2 volume-da3b0d29-3706-48e5-8628-044856794d78 102400M 2 volume-db03cd9d-d5d6-4b09-8c2e-a9d0be7bbb47 10240M 2 volume-e08c721e-059f-41c6-8f55-cb79797688ea 40960M 2 volume-e08c721e-059f-41c6-8f55-cb79797688ea@snapshot-bfecf3c0-0835-4fc1-a710-685e1f944acc 40960M 2 yes volume-e49a946d-15e2-4c63-9573-7a7fb3f2938b 10240M 2 volume-e62c6254-e54b-4a40-a979-96a25862b5b2 10240M 2 volume-e77208c6-50bc-4174-bb21-4089067c6c29 10240M 2 volume-edb013b3-7682-4129-97f7-23d5f483783f 20480M 2 volume-f3a7432b-c6c3-4966-97e5-a6866950c24b 20480M 2 volume-f4f66f86-c41b-4914-abf4-a77d0bbefa49 10240M 2 volume-fa06ffb5-600f-4a7b-9019-65d9ec1729d4 10240M 2 [root@overcloud-compute-0 ~]# [root@overcloud-compute-0 ~]# rbd bench-write volumes/data-disk3 --io-size 4096 --io-threads 16 --io-total 10000000000 --io-pattern seq bench-write io_size 4096 io_threads 16 bytes 10000000000 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 145978 145999.49 598013923.68 2 288173 144092.62 590203353.91 3 420567 140134.09 573989252.78 4 555840 138965.49 569202660.08 5 699641 139920.23 573113262.20 6 826637 136076.33 557368644.26 7 967951 135928.91 556764816.49 8 1045903 125062.80 512257217.08 9 1188959 126570.31 518431998.55 10 1318842 123812.25 507134975.81 11 1431096 120918.36 495281616.36 12 1537583 113870.90 466415215.19 13 1648178 120491.21 493531984.39 14 1784090 119054.84 487648620.73 15 1914959 119260.89 488492608.94 16 2052732 124350.70 509340450.06 17 2193732 131321.36 537892292.11 18 2337242 137812.97 564481936.55 elapsed: 18 ops: 2441407 ops/sec: 130009.18 bytes/sec: 532517618.61 ++++++++++++++++++++ From controller node ++++++++++++++++++++ CASE 1 [root@overcloud-controller-0 ~]# rbd create volumes/data-disk1 -s 204800 --image-format 2 [root@overcloud-controller-0 ~]# rbd bench-write volumes/data-disk1 --io-size 4096 --io-threads 16 --io-total 10000000000 --io-pattern seq bench-write io_size 4096 io_threads 16 bytes 10000000000 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 96516 96538.25 395420656.90 2 174398 87208.69 357206774.84 3 262656 87559.26 358642739.55 4 344730 86182.22 353002358.55 5 437353 87474.84 358296955.56 6 520981 84892.94 347721462.55 7 605124 86145.58 352852312.00 8 686550 84778.77 347253830.83 9 764774 84013.07 344117524.11 10 855034 83536.25 342164460.11 11 934666 82736.99 338890727.33 12 1017128 82400.82 337513775.69 13 1107463 84182.51 344811566.17 14 1180503 83145.92 340565679.11 15 1262974 81587.36 334181825.41 16 1332059 79478.62 325544410.06 17 1437295 84033.42 344200907.94 18 1565676 91642.72 375368578.86 19 1683208 100540.86 411815379.39 20 1802435 107892.99 441929675.70 21 1920824 117752.98 482316214.98 22 2009194 114379.85 468499869.47 23 2104777 107820.23 441631655.53 24 2191861 101730.73 416689082.30 25 2292499 98012.88 401460768.52 26 2386867 93208.64 381782600.61 elapsed: 28 ops: 2441407 ops/sec: 86784.91 bytes/sec: 355470991.53 [root@overcloud-controller-0 ~]# CASE 2 [root@overcloud-controller-0 ~]# rbd create volumes/data-disk2 -s 204800 --image-format 2 [root@overcloud-controller-0 ~]# rbd snap create volumes/data-disk2@snap [root@overcloud-controller-0 ~]# rbd snap protect volumes/data-disk2@snap [root@overcloud-controller-0 ~]# rbd clone volumes/data-disk2@snap volumes/data-disk3 [root@overcloud-controller-0 ~]# rbd -p volumes ls -l 2016-10-05 13:10:38.858010 7f8080a067c0 -1 librbd::ImageCtx: error reading immutable metadata: (2) No such file or directory 2016-10-05 13:10:39.007380 7f8080a067c0 -1 librbd::ImageCtx: error reading immutable metadata: (2) No such file or directory NAME SIZE PARENT FMT PROT LOCK data-disk1 200G 2 data-disk2 200G 2 data-disk2@snap 200G 2 yes data-disk3 200G volumes/data-disk2@snap 2 volume-00c9b6d3-df75-4df3-a6ee-c20d7c5846ff 10240M 2 volume-1177c82e-7e39-439a-a71f-f9b723c7cb52 10240M 2 volume-1d417105-10e4-44bc-91ce-7f8a712e49ca 10240M 2 volume-26b256cb-ed41-4569-ae3f-d5e800070a1c 26624M volumes/volume-c45d9aac-76f1-4417-b208-853f3ce078e8 2 volume-26d4c4b3-1ee4-46fe-bf3a-f63c59553d8b 10240M 2 volume-2f959a39-7010-4da8-8106-b5716e04c648 10240M 2 volume-36d68b80-d0e4-4a48-b919-b43815a04054 10240M 2 volume-3a9573d7-1df5-4cbe-914c-48275a6e530c 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-44212b31-06e6-4f2c-ad35-e8156fe46dae 10240M 2 volume-4d96e569-6711-4ac0-85e1-3a4c13258a4e 10240M 2 volume-52eb29c3-0bc0-4e2b-885d-27beabf95733 10240M 2 volume-55b01632-cc3d-4200-93fd-a2b326f39711 10240M 2 volume-56fb6c23-c8f2-475e-9960-8cc187b1cedd 10240M 2 volume-615b4bfa-51ce-4529-ae65-4361e67d0252 10240M 2 volume-661a8823-fbf2-40f2-b8ac-1789e3c8bb3f 10240M 2 volume-661b765e-68ec-4b9b-9482-6289c8f0d470 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-686931f1-0c97-4a28-a23a-e21470c40dbe 10240M 2 volume-6931d441-6dbe-4b51-81d4-208be35d27e0 10240M 2 volume-6d5b6019-5055-4683-af12-a47dd88d2c51 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-6d887507-a804-41e3-a633-4a5b0f651a90 10240M 2 volume-6e98a776-0679-417c-88b2-590394fa5bdc 10240M 2 volume-6ecc3b25-0cfb-46fb-8a10-2648ce1cc849 20480M 2 volume-72990018-260f-4880-939a-1b0566c2a262 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-88c88720-8963-4ecb-8dc3-6557681b98f1 20480M 2 volume-9b3c1ec4-7832-485e-94df-020b4b1fcd70 20480M 2 volume-9de05763-535a-4e28-8013-a28c3bd5dce1 10240M 2 volume-9debd6ff-760c-4c85-bcfa-7beec0614e48 10240M 2 volume-9ffa1c5f-f679-4261-b8d5-22371218f095 20480M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 volume-adcc668d-8ae7-4e79-bf86-4ea5b9054fe2 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-b14f9c4f-4d05-4509-8482-adf63d4702a5 10240M 2 volume-b60e26e1-a7f9-425a-8eca-5b8ee0fe76df 40960M 2 volume-c30438fb-238b-4768-8080-85479576fd41 30720M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 volume-c45d9aac-76f1-4417-b208-853f3ce078e8 26624M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 volume-c45d9aac-76f1-4417-b208-853f3ce078e8 26624M images/acd4bd27-dff9-4d75-9c20-fed7d31adb28@snap 2 yes volume-c9e89560-41bb-426b-9b8e-a06e9bcbb0e4 20480M images/37dc9447-11c4-4b58-99f8-e50305a02b06@snap 2 volume-ce4ddc0b-31ce-4e6f-a481-73bd339befa7 10240M 2 volume-d125897c-b3d4-4547-825b-69b3ee08a2b6 10240M 2 volume-d40c5dd1-3ed8-4d3c-b628-1adc062a4d7b 10240M 2 volume-da3b0d29-3706-48e5-8628-044856794d78 102400M 2 volume-db03cd9d-d5d6-4b09-8c2e-a9d0be7bbb47 10240M 2 volume-e08c721e-059f-41c6-8f55-cb79797688ea 40960M 2 volume-e08c721e-059f-41c6-8f55-cb79797688ea@snapshot-bfecf3c0-0835-4fc1-a710-685e1f944acc 40960M 2 yes volume-e49a946d-15e2-4c63-9573-7a7fb3f2938b 10240M 2 volume-e62c6254-e54b-4a40-a979-96a25862b5b2 10240M 2 volume-e77208c6-50bc-4174-bb21-4089067c6c29 10240M 2 volume-f4f66f86-c41b-4914-abf4-a77d0bbefa49 10240M 2 volume-fa06ffb5-600f-4a7b-9019-65d9ec1729d4 10240M 2 [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# rbd info volumes/data-disk3 rbd image 'data-disk3': size 200 GB in 51200 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.4d73c3d1b58ba format: 2 features: layering flags: parent: volumes/data-disk2@snap overlap: 200 GB [root@overcloud-controller-0 ~]# [root@overcloud-controller-0 ~]# rbd bench-write volumes/data-disk3 --io-size 4096 --io-threads 16 --io-total 10000000000 --io-pattern seq bench-write io_size 4096 io_threads 16 bytes 10000000000 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 99560 99580.71 407882596.19 2 200988 100413.26 411292721.31 3 255062 84975.27 348058726.17 4 326363 81595.70 334216002.90 5 400676 80094.24 328066014.35 6 465726 73232.75 299961331.04 7 539850 67796.92 277696179.89 8 609898 70993.29 290788509.75 9 687556 72238.57 295889165.59 10 743935 68690.23 281355182.58 11 807579 68370.98 280047550.12 12 868570 65744.03 269287546.04 13 927884 63588.56 260458745.64 14 977269 57938.93 237317852.52 15 1050866 61386.27 251438151.73 16 1135875 65640.19 268862226.02 17 1207778 67841.66 277879432.47 18 1280540 70540.73 288934842.32 19 1340824 72715.62 297843184.23 20 1411871 72200.93 295734990.68 21 1501737 73193.58 299800898.93 22 1568769 72198.20 295723812.41 23 1631478 70187.53 287488134.47 24 1684841 68803.34 281818473.38 25 1766260 70877.86 290315716.85 26 1838098 67272.24 275547088.59 27 1922422 70730.60 289712551.74 28 1991644 72033.30 295048414.43 29 2051437 73318.63 300313125.68 30 2125091 71738.08 293839181.94 31 2181269 68634.12 281125370.31 32 2250781 65660.90 268947027.05 33 2334020 68471.92 280460984.39 34 2389046 67522.39 276571699.69 elapsed: 35 ops: 2441407 ops/sec: 69155.75 bytes/sec: 283261938.87 ++++++++++ conclusion ++++++++++ ~~~ ==> Qcow2 image : a) controller node : elapsed: 28 ops: 2441407 ops/sec: 86784.91 bytes/sec: 355470991.53 b) compute node : elapsed: 17 ops: 2441407 ops/sec: 140094.59 bytes/sec: 573827429.96 ==> raw image : a) controller node : elapsed: 35 ops: 2441407 ops/sec: 69155.75 bytes/sec: 283261938.87 b) compute node : elapsed: 18 ops: 2441407 ops/sec: 130009.18 bytes/sec: 532517618.61 ~~~ Order of magnitude difference is seen in controller and compute performance. Still, relative difference from compute node is not twofold.
Performed test on RHEL 7 setup : Spawned two instances , one using qcow2 image and second using raw image. Raw image created parent child relation. ~~~ # rbd -p images children 04aad515-1fb8-4f79-8838-71d38dabba1f@snap vms/ca68fce0-76d2-4f85-b886-e4e5e02ccbff_disk ~~~ Now while using dd command to perform the test I can see that instance spawned from qcow2 image is giving twice more performance than instance spawned from raw image.
I had thought that Ceph with Nova requires use of raw image, correct? Because Ceph is doing the "backing image" in RBD that formerly was done using qcow2, correct? But evidently not so. Is Ceph imposing some sort of copy-on-write overhead for the Nova image that it doesn't need to impose? For example, is it reading from the snapshot the 4-KB filesystem blocks that dd is writing to? It is necessary to read the backing image if you insert data into the middle of a block, but if you are writing out the entire block, in theory it should be unnecessary to read the block from the snapshot first - since it doesn't matter what was stored there before. As the post suggests, we should be able to perform this test with RBD volume backed with a snapshot vs RBD volume not backed by a snapshot, using librbd engine in fio, and see if it's something to do with use of RBD snapshots. If you do this same test to a Cinder volume, which is not backed by a snapshot, what do you get? Another interesting test would be to repeat the dd test again, on the same exact file, using "conv=notrunc", so you would be writing to the same physical blocks in storage. This would be a "re-write" test. There should be no copy-on-write overhead at this point because the Nova image has already diverged from the backing snapshot. Note that if the above hypothesis is correct about copy-on-write overhead, then the qcow2 image much smaller than the raw image (i.e. sparse?), so there is less reading to do, so this might explain the difference in performance perhaps?
Strangely I have not seen the two fold difference this time. here the test results from OSP 7 setup. Commands used : # dd if=/dev/zero of=file1 bs=1024k count=1024 conv=fdatasync # dd if=/dev/zero of=file1 bs=1024k count=1024 conv=notrunc ------------------------------------ conv | qcow2 image | raw image | ----------------------------------- fdatasync| 104 MB/s | 85.8 MB/s| ------------------------------------ notrucn | 158 MB/s | 136 MB/s | -----------------------------------
Thanks, can you try "conv=fdatasync,notrunc" ? fdatasync is important because otherwise the data may not have reached persistent storage.
Hello, I am facing some issue with test setup. I will be sure to update you once the setup is functional again.
Sorry for delayed response. It's really difficult to get hold off of physical setup : This time I have used different HW with OSP 10 setup for re-producing the issue: Step 1 : Spawned two instances, one using qcow2 and other using raw image. ~~~ [root@overcloud-controller-0 ~]# nova list +--------------------------------------+-----------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+-----------------+--------+------------+-------------+-----------------------+ | 85814a47-2215-467e-a47f-63b191171c33 | qcow2-instance1 | ACTIVE | - | Running | internal1=10.10.10.11 | | 2b40849a-a99e-492f-93c1-db4c4b2ff80e | raw-instance1 | ACTIVE | - | Running | internal1=10.10.10.10 | +--------------------------------------+-----------------+--------+------------+-------------+-----------------------+ ~~~ Step 2 : Verified that disks are created on ceph backend. ~~~ [root@overcloud-controller-0 ~]# rbd -p images ls -l NAME SIZE PARENT FMT PROT LOCK 450293f9-8a49-4688-9667-85d2ee0a0fb8 10240M 2 450293f9-8a49-4688-9667-85d2ee0a0fb8@snap 10240M 2 yes c1728a18-f914-4222-93a6-692ff252eb6f 539M 2 c1728a18-f914-4222-93a6-692ff252eb6f@snap 539M 2 yes [root@overcloud-controller-0 ~]# rbd -p vms ls -l NAME SIZE PARENT FMT PROT LOCK 2b40849a-a99e-492f-93c1-db4c4b2ff80e_disk 20480M images/450293f9-8a49-4688-9667-85d2ee0a0fb8@snap 2 excl 85814a47-2215-467e-a47f-63b191171c33_disk 20480M 2 ~~~ Step 3 : Running tests Instance created using qcow2 image. ~~~ [root@host-10-10-10-11 ~]# time dd if=/dev/zero of=file1 bs=1024k count=1024 conv=fdatasync,notrunc 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.6126 s, 101 MB/s real 0m10.614s user 0m0.000s sys 0m0.532s ~~~ Instance created using raw image. ~~~ [root@host-10-10-10-10 ~]# time dd if=/dev/zero of=file1 bs=1024k count=1024 conv=fdatasync,notrunc 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 15.1281 s, 71.0 MB/s real 0m15.130s user 0m0.000s sys 0m0.628s ~~~ difference is still significant. I can run more tests if you want me to do that.
Vikrant, I'd like to see whether this behaves the same in our own configuration. thanks for your help. It's on my to-do list. This is a significant perf. difference that you are observing. I didn't see the RHCS version you were using, is it in here? If not could you provide that? Also, did you run these tests more than once on the same volume and was there any difference in the performance the 2nd-Nth times? With RBD create, it's not actually allocating or initializing the volume when it creates it AFAIK, and this causes performance measurements for the first write to be different than measurements for subsequent writes. That's why Tim and I dd to entire cinder volume and treat this as a separate test from measuring its steady-state performance. Also, I didn't see you dropping cache in any of these tests, so this introduces variability between runs as well. Yes, you would expect the raw image to be written to faster than the qcow2 image, since there is no backing image to account for. There may be some complex behaviors around whether the backing image is cached or not, what kind of write I/O pattern is being done, how qcow2 images differ from raw images, etc. -ben
Ceph version which is installed with OSP 10 : ~~~ puppet-ceph-2.2.1-3.el7ost.noarch ceph-osd-10.2.2-41.el7cp.x86_64 ceph-common-10.2.2-41.el7cp.x86_64 python-cephfs-10.2.2-41.el7cp.x86_64 ceph-base-10.2.2-41.el7cp.x86_64 ceph-mon-10.2.2-41.el7cp.x86_64 ceph-selinux-10.2.2-41.el7cp.x86_64 libcephfs1-10.2.2-41.el7cp.x86_64 ceph-radosgw-10.2.2-41.el7cp.x86_64 ~~~ C#9 results are shared while spawning instance using image. I have ran the test only once.
Question: does this happen when a Ceph backend is not used? What happens when Ephemeral or other storage is used? I'm trying to determine if behavior described in this bz has anything to do with Ceph - that determines who needs to work on it. Question 2: why use qcow2 if you have Ceph RBD functionality to do copy-on-write? I think the answer is that qcow2 image is really really small, in initial post it is 1/2 GB of physical space representing a 10-GB virtual image. This makes it much quicker to load and cache the entire glance image, which can only help performance. The key observation here is that when we flatten the Nova images (eliminate the backing image), then the performance of the two images becomes the same. I think this is consistent with the hypothesis in comment 4.
qcow2 images are not supported in ceph
To be a little more specific, qcow2 images are not supported as Glance images, see http://docs.ceph.com/docs/master/rbd/rbd-openstack/ "Ceph doesn’t support QCOW2 for hosting a virtual machine disk. Thus if you want to boot virtual machines in Ceph (ephemeral backend or boot from volume), the Glance image format must be RAW." Josh Durgin and Jason Dillaman confirmed this.
I do agree that we are not supporting qcow2 when using ceph backend. Cu. just this for showing the difference in performance results when using both disk types.