Bug 1461536
Summary: | [TestOnly] RHEL 7.4 | VM pause running IO after extending block(ISCSI/FCP) spare qcow disk | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Avihai <aefrat> |
Component: | Core | Assignee: | Nir Soffer <nsoffer> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Avihai <aefrat> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | --- | CC: | aefrat, amureini, bugs, coli, dyuan, eberman, jiyan, juzhang, lmen, nsoffer, pkrempa, ratamir, xuzhang, ykaul, ylavi |
Target Milestone: | ovirt-4.1.4-1 | Keywords: | Regression, Reopened, TestOnly |
Target Release: | 4.19.25 | Flags: | rule-engine:
ovirt-4.1+
rule-engine: blocker+ |
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-23 08:02:57 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1470127, 1473145 | ||
Bug Blocks: | |||
Attachments: |
Description
Avihai
2017-06-14 17:38:39 UTC
Eyal, please work with Aviahi and let's ensure it is the same issue we are familiar with. (In reply to Yaniv Kaul from comment #1) > Eyal, please work with Aviahi and let's ensure it is the same issue we are > familiar with. Link? Guy Chen is looking at this issue & will respond shortly. More info on my side: Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total size of 1328MB with bs=1MB https://bugzilla.redhat.com/show_bug.cgi?id=1461536 By VSDM log timeline : 2017-06-11 11:46:44 Extend suceeded 2017-06-11 11:46:45 dd started 2017-06-11 11:47:21 VM pause Time dd ran = ~45 sec , which if extend was not done, filled 1GB & paused the VM. Example of the dd done: > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none' (In reply to Allon Mureinik from comment #2) > (In reply to Yaniv Kaul from comment #1) > > Eyal, please work with Aviahi and let's ensure it is the same issue we are > > familiar with. > Link? https://bugzilla.redhat.com/show_bug.cgi?id=1408594 (In reply to Avihai from comment #3) > Guy Chen is looking at this issue & will respond shortly. > > More info on my side: > Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total > size of 1328MB with bs=1MB > > > https://bugzilla.redhat.com/show_bug.cgi?id=1461536 > By VSDM log timeline : > 2017-06-11 11:46:44 Extend suceeded Is this extend relevant? > 2017-06-11 11:46:45 dd started > 2017-06-11 11:47:21 VM pause > > Time dd ran = ~45 sec , which if extend was not done, filled 1GB & paused > the VM. 45s for 1GB is not very fast. However, what is interesting is when do we fill >50%, how long it takes for the whole extending process. You can find out via the VDSM logs - the whole flow, through the mailbox, etc. 45secs is quite a log actually. > > Example of the dd done: > > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none' This is not a very good 'dd' command. For example, it uses caching. (In reply to Yaniv Kaul from comment #5) > (In reply to Avihai from comment #3) > > Guy Chen is looking at this issue & will respond shortly. > > > > More info on my side: > > Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total > > size of 1328MB with bs=1MB > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1461536 > > By VSDM log timeline : > > 2017-06-11 11:46:44 Extend suceeded > > Is this extend relevant? Yes, this is my initiated extent done on the disk from 1G->2G , why is this not relevant? You can see bugs in original description in VDSM log that it finished successfully at 11:46:44. > > 2017-06-11 11:46:45 dd started > > 2017-06-11 11:47:21 VM pause > > > > Time dd ran = ~45 sec , which if extend was not done, filled 1GB & paused > > the VM. > > 45s for 1GB is not very fast. However, what is interesting is when do we > fill >50%, how long it takes for the whole extending process. You can find > out via the VDSM logs - the whole flow, through the mailbox, etc. 45secs is > quite a log actually. Right before VM pause I see the following mailbox-spm messages in VDSM log, but these messages are all over the log before the dd started, how do you know from them when extend was done? From VDSM log : 2017-06-11 11:47:19,434+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-11 dd if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69) 2017-06-11 11:47:19,453+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00361567 s, 283 MB/s\n'; <rc> = 0 (commands:93) 2017-06-11 11:47:21,230+0300 INFO (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') abnormal vm stop device virtio-disk1 error enospc (vm:4166) 2017-06-11 11:47:21,231+0300 INFO (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onIOError (vm:5045) 2017-06-11 11:47:21,468+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-11 dd if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69) 2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93) 2017-06-11 11:47:21,508+0300 WARN (periodic/1) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on ['ca4c86cb-dba3-4feb-8d83-5f864b24c487'] (periodic:301) 2017-06-11 11:47:21,776+0300 INFO (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') No VM drives were extended (vm:4173) 2017-06-11 11:47:21,777+0300 INFO (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onSuspend (vm:5045) > > Example of the dd done: > > > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none' > > This is not a very good 'dd' command. For example, it uses caching. I know client caching is not disabled as "-oflag=direct" is not used in the dd here but this is a functional test & we are not testing performance so in the worst case caching should make a faster write IO/s here but we should be able to handle it. (In reply to Avihai from comment #6) > (In reply to Yaniv Kaul from comment #5) > > (In reply to Avihai from comment #3) > > > Guy Chen is looking at this issue & will respond shortly. > > > > > > More info on my side: > > > Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total > > > size of 1328MB with bs=1MB > > > > > > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1461536 > > > By VSDM log timeline : > > > 2017-06-11 11:46:44 Extend suceeded > > > > Is this extend relevant? > Yes, this is my initiated extent done on the disk from 1G->2G , > why is this not relevant? Because we'd like to understand how long the next extend took. Although you could compare it to this extend - how much time did it take? > > You can see bugs in original description in VDSM log that it finished > successfully at 11:46:44. > > > > 2017-06-11 11:46:45 dd started > > > 2017-06-11 11:47:21 VM pause > > > > > > Time dd ran = ~45 sec , which if extend was not done, filled 1GB & paused > > > the VM. > > > > 45s for 1GB is not very fast. However, what is interesting is when do we > > fill >50%, how long it takes for the whole extending process. You can find > > out via the VDSM logs - the whole flow, through the mailbox, etc. 45secs is > > quite a log actually. > > Right before VM pause I see the following mailbox-spm messages in VDSM log, > but these messages are all over the log before the dd started, how do you > know from them when extend was done? Check the SPM logs - he's the one extending. > > From VDSM log : > 2017-06-11 11:47:19,434+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] > /usr/bin/taskset --cpu-list 0-11 dd > if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/ > inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69) > 2017-06-11 11:47:19,453+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] > SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) > copied, 0.00361567 s, 283 MB/s\n'; <rc> = 0 (commands:93) > 2017-06-11 11:47:21,230+0300 INFO (libvirt/events) [virt.vm] > (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') abnormal vm stop device > virtio-disk1 error enospc (vm:4166) This is already too late, I reckon. > 2017-06-11 11:47:21,231+0300 INFO (libvirt/events) [virt.vm] > (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onIOError > (vm:5045) > 2017-06-11 11:47:21,468+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] > /usr/bin/taskset --cpu-list 0-11 dd > if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/ > inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69) > 2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] > SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) > copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93) > 2017-06-11 11:47:21,508+0300 WARN (periodic/1) [virt.periodic.VmDispatcher] > could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on > ['ca4c86cb-dba3-4feb-8d83-5f864b24c487'] (periodic:301) > 2017-06-11 11:47:21,776+0300 INFO (libvirt/events) [virt.vm] > (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') No VM drives were extended > (vm:4173) Not sure what this is... > 2017-06-11 11:47:21,777+0300 INFO (libvirt/events) [virt.vm] > (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onSuspend > (vm:5045) > > > > > Example of the dd done: > > > > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none' > > > > This is not a very good 'dd' command. For example, it uses caching. > I know client caching is not disabled as "-oflag=direct" is not used in the > dd here but this is a functional test & we are not testing performance so in > the worst case caching should make a faster write IO/s here but we should be > able to handle it. I believe caching is (sometimes) bad as it does a burst of IO - it will dump all pending IO at once when sync'ing. If you do not use caching, it hopefully will not burst. We have less chance of dealing with a burst of IO, that's all. I am a libvirt qe
I try to use the simplest scenario without RHV,iscsi,only used libvirt and lv,can reproduce this issue.
Maybe this issue is related to the resize operation on *lv* and error_policy='stop'(in RHV,the disk also uses this value)
resize operation causes the IO error,and error_policy='stop' causes the guest paused.
version:
libvirt-3.2.0-15.el7a.x86_64
qemu-kvm-rhev-2.9.0-14.el7.x86_64
steps:
1.prepare a 1G qcow2 lv
[root@lmen1 ~]# lvcreate -L 1G -n lvv vgg
Logical volume "lvv" created.
[root@lmen1 ~]# qemu-img convert -O qcow2 /dev/vgg/lvv /dev/vgg/lvv
2. start a guest with:
<disk type='block' device='disk'>
<driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/>
<source dev='/dev/vgg/lvv'/>
<target dev='sdb' bus='scsi'/>
<address type='drive' controller='0' bus='0' target='0' unit='1'/>
</disk>
[root@lmen1 ~]# virsh start test
Domain test started
in the guest,there is the 1G disk sdb
3.resize the disk
[root@lmen1 ~]# virsh blockresize test --path /dev/vgg/lvv --size 2G
Block device '/dev/vgg/lvv' is resized
in the guest,the size of sdb is 2G
4.run dd command in the guest
in the guest,run "dd if=/dev/zero of=/dev/sdb bs=1M count=1358"
5.check the guest status in the host
[root@lmen1 ~]# virsh list
Id Name State
----------------------------------------------------
33 test paused
the log info is uploaded as the attachment1 [details].
PS:
I have tried other scenarios:
1.change error_policy as 'report' in step 2,and repeat all the the steps above
result: the guest is still in running status,but in the guest,some IO errors are output,the error info is uploaded as the 2nd attachment.
So I think the paused status is caused by error_policy='stop'.
2.keep error_policy as 'stop',replace lv with '/var/lib/libvirt/images/lmen.qcow2',and repeat the steps above
result:dd successfully,no error, and the guest is still in running status
So I think the issue is related to lv
3.keep error_policy as 'stop',create a *2G* qcow2 lv directly,and start the guest with this lv. run "dd if=/dev/zero of=/dev/sdb bs=1M count=1358"
result: dd successfully,no error, and the guest is still in running status
So I think the issue is related to resize operation on lv
4.repeate all the steps,but in step4,run "dd if=/dev/zero of=/dev/sdb bs=1M count=900"
result: dd successfully,no error, and the guest is still in running status.But when count>=1000,the guest will be paused.
I am not sure if the steps are valid testing scenario. Could anyone help check its validity?
If it is vaild,could qemu guys help reproduce this issue?
Created attachment 1290965 [details]
log in /var/log/libvirt/libvirtd.log from lmen
Created attachment 1290966 [details]
error info in the guest from lmen
(In reply to Avihai from comment #0) ...... > > Steps to Reproduce: > 1.Clone VM from template with OS from iscsi/FC storage domain. > 2.create additional thin disk from iscsi/FC with: > size = 1G > format = QCOW > sparse = True hi, Does RHV devel/QE help to let libvirt QE knows what libvirt(or other component) API RHV called while implementing the following extending disk action? then we (libvirt QE) can better judge whether the pure libvirt reproduce steps in comment10 is right or not. > 3.Extend the created disk with an additonal 1G ( New total size 2G) > 4.Create partition + FS (ext4) on new disk > 5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~= > 1358M), for example in the local run I saw: > ....... Avihai, could you please help answer the libvirt qe questions in comment #13 ? Hi Xuesong Zhang , I'm not sure if this is what you mean but here goes , Storage dev please DO correct me if I'm wrong: There are 2 extends types & 2 matching API's used here: 1) user initiated disk extend (disk size from 1G -> 2G) API: vdsm.api -> function 'extendVolumeSize' : VDSM example: 2017-06-11 11:46:44,460+0300 INFO (jsonrpc/6) [vdsm.api] START extendVolumeSize(spUUID='d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd', sdUUID='df501232-7b90-4c2a-b0a6-6ed4570f3448', imgUUID='2a555c74-e36b-4345-9afd-65b181c80001', volUUID='277075e4-d02c-4f9c-8347-42c62515b012', newSize='2147483648') from=::ffff:10.35.69.94,45266, flow_id=disks_update_63c5069c-a77f-4656 (api:46) 2017-06-11 11:46:44,500+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='4da2373b-dd7c-4391-b007-ab58a16b46d9') Task.run: running job 0: extendVolumeSize: <bound method StoragePool.extendVolumeSize of <storage.sp.StoragePool object at 0x30eb210>> (args: ('df501232-7b90-4c2a-b0a6-6ed4570f3448', '2a555c74-e36b-4345-9afd-65b181c80001', '277075e4-d02c-4f9c-8347-42c62515b012', 4194304) kwargs: {}) (task:909) 2) As in storages, thin provisioned disk is growing(extending) in extents(fixed size) so storage only uses the used amount of disk size as user use(write IO) more extents. As I understood from Yaniv K.(comment 5) API:: mailbox-spm -> storage.Misc.excCmd: Example : 2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93) As you can see here extent size = 1MB. no matter which method to extend the size of the block(using vdsm api or virsh blockresize cmd),the results are the same: 'qemu-img info' can list the right size(2G),and the size in the guest is also right. So we think,the main problem occurs in the following operation: write IOs(dd) to the disk,we don't need to focus too much on the extending methods. Maybe it is a qemu issue,so @juzhang,could you please find the suitable qemu guy to help reproduce the issue? (In reply to lijuan men from comment #16) > no matter which method to extend the size of the block(using vdsm api or > virsh blockresize cmd),the results are the same: 'qemu-img info' can list > the right size(2G),and the size in the guest is also right. > > So we think,the main problem occurs in the following operation: write > IOs(dd) to the disk,we don't need to focus too much on the extending methods. > > Maybe it is a qemu issue,so @juzhang,could you please find the suitable qemu > guy to help reproduce the issue? Hi Coli, Could you handle this? Best Regards, Junyi Adding my latest discussion with Yaniv K. , it looks like that in rhel7.4 write IO rate is higher than rhel7.3 which cause extend not to be triggered fast/early enough which trigger this bug. Discussion highlights: " Me: - I checked again with the current dd in rhel7.3 & the issue does not reproduce. - Checking in rhel7.4 issue reproduce with the same build & current dd the issue reproduces right away. - Slowing down the dd ( with if=/dev/urandom & oflag=direct ) did solve the issue but this is not the solution bu just an indication of the problem. Original dd performed: '/bin/dd bs=1M count=1358 if=/dev/vda of=/mount-point_2617184396/written_test_storage status=none' Slow dd performed: '/bin/dd bs=64k count=1358 if=/dev/urandom of=/mount-point_2618294159/written_test_storage status=none oflag=direct' It's the same exact host so no change in the host itself but the upgraded to 7.4 which I did after checking the issue on 7.3 . however, I did not measure IOP's (for some reason kill -USR1 <dd_PID> did not work) Anyway, it looks like 7.4 IOPs are higher which cause this issue to appear . Yaniv K.: Exactly. And higher IOPS are a good thing. Question is if we should change our default 1g extend to a larger value. " (In reply to Avihai from comment #15) > 2) As in storages, thin provisioned disk is growing(extending) in > extents(fixed size) so storage only uses the used amount of disk size as > user use(write IO) more extents. Hi Avihai, Could you please provides the information about the thin provisioned disk ? cat /sys/block/sd*/queue/discard_max_bytes cat /sys/devices/***/provisioning_mode Thanks. (In reply to CongLi from comment #19) > (In reply to Avihai from comment #15) > > 2) As in storages, thin provisioned disk is growing(extending) in > > extents(fixed size) so storage only uses the used amount of disk size as > > user use(write IO) more extents. > > Hi Avihai, > > Could you please provides the information about the thin provisioned disk ? > > cat /sys/block/sd*/queue/discard_max_bytes Hi Cong, Do you mean on the VM itself or on the host ? If you mean on VM than unfortunetly VM is paused so I can not connect to it & extract this info. If you meant on the host(which the VM resides on) then this is what I get: cat /sys/block/sd*/queue/discard_max_bytes 1073741824 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 8388608 > cat /sys/devices/***/provisioning_mode This did not work: [root@storage-ge1-vdsm3 log]# cat /sys/devices/***/provisioning_mode cat: /sys/devices/***/provisioning_mode: No such file or directory But with find utility, I locate where there are provisioning mode files: [root@storage-ge1-vdsm3 log]# find /sys/devices/ -name provisioning_mode /sys/devices/pci0000:00/0000:00:04.0/virtio1/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode /sys/devices/platform/host3/session1/target3:0:0/3:0:0:1/scsi_disk/3:0:0:1/provisioning_mode /sys/devices/platform/host3/session1/target3:0:0/3:0:0:2/scsi_disk/3:0:0:2/provisioning_mode /sys/devices/platform/host3/session1/target3:0:0/3:0:0:3/scsi_disk/3:0:0:3/provisioning_mode /sys/devices/platform/host3/session1/target3:0:0/3:0:0:4/scsi_disk/3:0:0:4/provisioning_mode /sys/devices/platform/host3/session1/target3:0:0/3:0:0:5/scsi_disk/3:0:0:5/provisioning_mode /sys/devices/platform/host3/session1/target3:0:0/3:0:0:6/scsi_disk/3:0:0:6/provisioning_mode /sys/devices/platform/host4/session2/target4:0:0/4:0:0:1/scsi_disk/4:0:0:1/provisioning_mode /sys/devices/platform/host4/session2/target4:0:0/4:0:0:2/scsi_disk/4:0:0:2/provisioning_mode /sys/devices/platform/host4/session2/target4:0:0/4:0:0:3/scsi_disk/4:0:0:3/provisioning_mode /sys/devices/platform/host4/session2/target4:0:0/4:0:0:4/scsi_disk/4:0:0:4/provisioning_mode /sys/devices/platform/host4/session2/target4:0:0/4:0:0:5/scsi_disk/4:0:0:5/provisioning_mode /sys/devices/platform/host4/session2/target4:0:0/4:0:0:6/scsi_disk/4:0:0:6/provisioning_mode /sys/devices/platform/host5/session3/target5:0:0/5:0:0:1/scsi_disk/5:0:0:1/provisioning_mode /sys/devices/platform/host5/session3/target5:0:0/5:0:0:2/scsi_disk/5:0:0:2/provisioning_mode /sys/devices/platform/host5/session3/target5:0:0/5:0:0:3/scsi_disk/5:0:0:3/provisioning_mode /sys/devices/platform/host5/session3/target5:0:0/5:0:0:4/scsi_disk/5:0:0:4/provisioning_mode /sys/devices/platform/host5/session3/target5:0:0/5:0:0:5/scsi_disk/5:0:0:5/provisioning_mode /sys/devices/platform/host5/session3/target5:0:0/5:0:0:6/scsi_disk/5:0:0:6/provisioning_mode /sys/devices/platform/host6/session4/target6:0:0/6:0:0:1/scsi_disk/6:0:0:1/provisioning_mode /sys/devices/platform/host6/session4/target6:0:0/6:0:0:2/scsi_disk/6:0:0:2/provisioning_mode /sys/devices/platform/host6/session4/target6:0:0/6:0:0:3/scsi_disk/6:0:0:3/provisioning_mode /sys/devices/platform/host6/session4/target6:0:0/6:0:0:4/scsi_disk/6:0:0:4/provisioning_mode /sys/devices/platform/host6/session4/target6:0:0/6:0:0:5/scsi_disk/6:0:0:5/provisioning_mode *Cat all these files produce: [root@storage-ge1-vdsm3 log]# cat /sys/devices/platform/host*/session*/target*:0:0/*:0:0:*/scsi_disk/*:0:0:*/provisioning_mode writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 writesame_16 [root@storage-ge1-vdsm3 log]# cat /sys/devices/pci0000:00/0000:00:04.0/virtio1/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode unmap > Thanks. (In reply to Avihai from comment #20) > Hi Cong, > > Do you mean on the VM itself or on the host ? I mean the host. > If you mean on VM than unfortunetly VM is paused so I can not connect to it > & extract this info. > > If you meant on the host(which the VM resides on) then this is what I get: > > cat /sys/block/sd*/queue/discard_max_bytes > 1073741824 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > 8388608 > > > > cat /sys/devices/***/provisioning_mode > > This did not work: > [root@storage-ge1-vdsm3 log]# cat /sys/devices/***/provisioning_mode > cat: /sys/devices/***/provisioning_mode: No such file or directory Thanks Avihai. I'm sorry to confuse you, I used my way to describe it. Actually, '***' means your device info. Like: cat /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host11/rport-11:0-2/target11:0:0/11:0:0:1/scsi_disk/11:0:0:1/provisioning_mode writesame_16 Could you please help confirm if this bug could be reproduced on a thick (non-thin provisioned) disk? I'm just thinking whether it's a thin provisioned disk problem. FYI: https://bugzilla.redhat.com/show_bug.cgi?id=1462594 Thanks. (In reply to CongLi from comment #21) > (In reply to Avihai from comment #20) > > Hi Cong, > > > > Do you mean on the VM itself or on the host ? > > I mean the host. > > > If you mean on VM than unfortunetly VM is paused so I can not connect to it > > & extract this info. > > > > If you meant on the host(which the VM resides on) then this is what I get: > > > > cat /sys/block/sd*/queue/discard_max_bytes > > 1073741824 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > 8388608 > > > > > > > cat /sys/devices/***/provisioning_mode > > > > This did not work: > > [root@storage-ge1-vdsm3 log]# cat /sys/devices/***/provisioning_mode > > cat: /sys/devices/***/provisioning_mode: No such file or directory > > Thanks Avihai. > > I'm sorry to confuse you, I used my way to describe it. > Actually, '***' means your device info. > Like: > cat > /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host11/rport-11:0-2/ > target11:0:0/11:0:0:1/scsi_disk/11:0:0:1/provisioning_mode > writesame_16 > > > Could you please help confirm if this bug could be reproduced on a thick > (non-thin provisioned) disk? This bug only occurs on a thin-provisioned disk. I tried to reproduce on a RAW prealocated disk & issue did not reproduce. > I'm just thinking whether it's a thin provisioned disk problem. > FYI: > https://bugzilla.redhat.com/show_bug.cgi?id=1462594 > > > Thanks. Please tell me if you need any more info. (In reply to Avihai from comment #22) > > Could you please help confirm if this bug could be reproduced on a thick > > (non-thin provisioned) disk? > > This bug only occurs on a thin-provisioned disk. > I tried to reproduce on a RAW prealocated disk & issue did not reproduce. > > > I'm just thinking whether it's a thin provisioned disk problem. > > FYI: > > https://bugzilla.redhat.com/show_bug.cgi?id=1462594 > > > > > > Thanks. > > Please tell me if you need any more info. Hi Avihai, Confirmed with storage QE, he thinks it's similar to BZ1462594, but even the developers are still investigating what is the problem on kernel. I think you could ask kernel storage team for more investigation. Thanks. (In reply to CongLi from comment #23) > (In reply to Avihai from comment #22) > > > Could you please help confirm if this bug could be reproduced on a thick > > > (non-thin provisioned) disk? > > > > This bug only occurs on a thin-provisioned disk. > > I tried to reproduce on a RAW prealocated disk & issue did not reproduce. > > > > > I'm just thinking whether it's a thin provisioned disk problem. > > > FYI: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1462594 > > > > > > > > > Thanks. > > > > Please tell me if you need any more info. > > Hi Avihai, > > Confirmed with storage QE, he thinks it's similar to BZ1462594, but even the > developers are still investigating what is the problem on kernel. > > I think you could ask kernel storage team for more investigation. > > Thanks. Hi Cong, Can you please add a NeedINFO request with the kernel storage team focal point? Or if you can tell me who it & I'll do it . Thanks, Thank Is your VM running with virtio-scsi and discard enabled? and what are you discarding exactly? (In reply to Yaniv Kaul from comment #25) > Is your VM running with virtio-scsi? No, disk is with virtio interface. > and discard enabled? I'm not sure what you mean, but if you are talking about discard after delete of the storage domain than it is disabled. > and what are you discarding exactly? I'm not aware of intentionally wanting to discard anything during the test. (In reply to Avihai from comment #26) > (In reply to Yaniv Kaul from comment #25) > > Is your VM running with virtio-scsi? > No, disk is with virtio interface. > > > and discard enabled? > I'm not sure what you mean, but if you are talking about discard after > delete of the storage domain than it is disabled. > > > and what are you discarding exactly? > I'm not aware of intentionally wanting to discard anything during the test. So the bug mentioned above ( https://bugzilla.redhat.com/show_bug.cgi?id=1462594 ) is completely irrelevant to this one. And I'm still convinced it's the same as our known issue. Which you can easily prove by changing the allocation size for example. (In reply to Yaniv Kaul from comment #27) > (In reply to Avihai from comment #26) > > (In reply to Yaniv Kaul from comment #25) > > > Is your VM running with virtio-scsi? > > No, disk is with virtio interface. > > > > > and discard enabled? > > I'm not sure what you mean, but if you are talking about discard after > > delete of the storage domain than it is disabled. > > > > > and what are you discarding exactly? > > I'm not aware of intentionally wanting to discard anything during the test. > > So the bug mentioned above ( > https://bugzilla.redhat.com/show_bug.cgi?id=1462594 ) is completely > irrelevant to this one. > And I'm still convinced it's the same as our known issue. Which you can > easily prove by changing the allocation size for example. Didn't we proove this already with the slow dd test ? Why do we need another test to proove that ? If slow dd test is enough please reassign accordigly . If this test is really necessary then please clarify what you mean by changing allocation size with a clear scenario& I'll retest with hoping to provide more clarity on this issue. (In reply to Avihai from comment #28) > (In reply to Yaniv Kaul from comment #27) > > (In reply to Avihai from comment #26) > > > (In reply to Yaniv Kaul from comment #25) > > > > Is your VM running with virtio-scsi? > > > No, disk is with virtio interface. > > > > > > > and discard enabled? > > > I'm not sure what you mean, but if you are talking about discard after > > > delete of the storage domain than it is disabled. > > > > > > > and what are you discarding exactly? > > > I'm not aware of intentionally wanting to discard anything during the test. > > > > So the bug mentioned above ( > > https://bugzilla.redhat.com/show_bug.cgi?id=1462594 ) is completely > > irrelevant to this one. > > And I'm still convinced it's the same as our known issue. Which you can > > easily prove by changing the allocation size for example. > > Didn't we proove this already with the slow dd test ? > Why do we need another test to proove that ? > If slow dd test is enough please reassign accordigly . > > If this test is really necessary then please clarify what you mean by > changing allocation size with a clear scenario& I'll retest with hoping to > provide more clarity on this issue. *** This bug has been marked as a duplicate of bug 1408594 *** Comment from Avihai: Hi All, It looks like it's indeed a rhel7.4 running IOPs faster that is the root cause here. The following test was done in the same ovirt-guest-agent on latest 4.2 with rhel7.4 hosts to confirm it. Test Details: Engine: 4.2.0-0.0.master.20170627181935.git9424f9b VDSM: 4.20.1-89 1) We added in /etc/vdsm/vdsm.conf the following 2 parameters in 'irs' section to start extend at 25%(instead of 50%) and to extend to 2G (instead of 1GB) & restared VDSM. [vars] vm_watermark_interval = 2 [irs] volume_utilization_percent = 25 volume_utilization_chunk_mb = 2048 2) Ran the same scenario as in the bug & issue did not reproduce - this was done twice. 3) When we returned to the default parameters - issue reproduced. copy-and-bug, please ignore the previous bug marks as blocking this one. According to bug 1467826, thin disks are *never* extended on rhel 7.4. I believe the what we see in this bug is: 1. Create 1g snapshot 2. Try to write 1.3g of data to disk 3. The vm pause When you change ovirt defaults, using [irs] volume_utilization_chunk_mb = 2048 We create a 2g disk, so copying 1.3g of data does not pause the vm. I don't see any data yet that supports the claim of higher io rate on 7.4. When bug 1467826 is fixed, we should test this again. Wrong window, my apologies. This is obviously critical for 4.1.4 Raz - based on the verification of the underlying issues, bug 1470127 and bug 1473145 - can this be set to VERIFIED? The bag is still here. rhevm-4.1.5-0.1.el7.noarch vdsm-4.19.25-1.el7ev.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64 engine.log: 2017-07-30 13:49:06,513+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED(1,025), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm_TestCase5063_REST_ISCSI_3013412467 has been paused. 2017-07-30 13:49:06,630+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED_ENOSPC(138), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm_TestCase5063_REST_ISCSI_3013412467 has been paused due to no Storage space error. vdsm.log: 2017-07-30 13:49:06,326+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174) 2017-07-30 13:49:06,328+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onSuspend (vm:5048) 2017-07-30 13:49:06,328+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167) 2017-07-30 13:49:06,329+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048) 2017-07-30 13:49:06,337+0300 DEBUG (check/loop) [storage.asyncevent] Process <cpopen.CPopen object at 0x2f70c50> terminated (count=1) (asyncevent:565) 2017-07-30 13:49:06,338+0300 DEBUG (check/loop) [storage.check] FINISH check u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge5__nfs__1/ce1a3bf8-a91a-4f9e-afaf-a4a393e1c67e/dom_md/metadata' (rc=0, elapsed=0.05) (check:328) * Logs attached Created attachment 1306556 [details]
new_logs
(In reply to Raz Tamir from comment #35) > The bag is still here. The *bug* is still here. > > rhevm-4.1.5-0.1.el7.noarch > vdsm-4.19.25-1.el7ev.x86_64 > qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64 > > engine.log: > > 2017-07-30 13:49:06,513+03 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED(1,025), Correlation ID: > null, Call Stack: null, Custom Event > ID: -1, Message: VM vm_TestCase5063_REST_ISCSI_3013412467 has been paused. > 2017-07-30 13:49:06,630+03 ERROR > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED_ENOSPC(138), Correlation > ID: null, Call Stack: null, Custom Event ID: -1, Message: VM > vm_TestCase5063_REST_ISCSI_3013412467 has been paused due to no Storage > space error. > > > vdsm.log: > > 2017-07-30 13:49:06,326+0300 INFO (libvirt/events) [virt.vm] > (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended > (vm:4174) > 2017-07-30 13:49:06,328+0300 INFO (libvirt/events) [virt.vm] > (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onSuspend > (vm:5048) > 2017-07-30 13:49:06,328+0300 INFO (libvirt/events) [virt.vm] > (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device > virtio-disk1 error enospc (vm:4167) > 2017-07-30 13:49:06,329+0300 INFO (libvirt/events) [virt.vm] > (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError > (vm:5048) > 2017-07-30 13:49:06,337+0300 DEBUG (check/loop) [storage.asyncevent] Process > <cpopen.CPopen object at 0x2f70c50> terminated (count=1) (asyncevent:565) > 2017-07-30 13:49:06,338+0300 DEBUG (check/loop) [storage.check] FINISH check > u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com: > _Storage__NFS_storage__local__ge5__nfs__1/ce1a3bf8-a91a-4f9e-afaf- > a4a393e1c67e/dom_md/metadata' (rc=0, elapsed=0.05) (check:328) > > > * Logs attached (In reply to Raz Tamir from comment #35) > rhevm-4.1.5-0.1.el7.noarch > vdsm-4.19.25-1.el7ev.x86_64 > qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64 Which libvirt version did you test? see bug 1467826 I'm using libvirt-3.2.0-14.el7.x86_64 which is the same as bug #1470127 verified with. (In reply to Raz Tamir from comment #39) > I'm using libvirt-3.2.0-14.el7.x86_64 which is the same as bug #1470127 > verified with. Is this a copy-paste bug (i.e., are you missing the last part)? Since that bug was verified with libvirt-3.2.0-14.el7_4.2 No, This is the version on my host. We will wait for verification for the correct libvirt that should land soon Keep this bug on NEW for now To test this bug, we need to compare the behavior with 7.3 and 7.4. Pausing during extend flow is expected, we want to know if in 7.4 we have more pauses compared with 7.3. Also testing should be done with master or with the attached patch, improving logging for extend flow: https://gerrit.ovirt.org/79969 Keeping the need info until we have tests results from both 7.3 and 7.4. Avihay, this test contains a thin disk extend before writing io to disk. Do you have another test that write to disk without extending it? For example: 1. Create thin disk of 2g 2. Attach disk to vm 3. Create file system and mount the disk 4. Copy 1.3g of data into new disk Does this fail, or only when we extend the disk before the test it fails? Looking at the logs from attachment 1306556 [details]:
1. We don't have log for the extend operation, only for the completion:
2017-07-30 13:43:35,728+0300 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.diskSizeExtend succeeded in 0.15 seconds (__init__:539)
(I don't have any information when the disk copy was started, please include this
information in the next test)
2. The vm paused
2017-07-30 13:49:06,226+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,227+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
3. We checked the drives, and no drive needs extension:
2017-07-30 13:49:06,326+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
For some reason libvirt send more events about this...
2017-07-30 13:49:06,328+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onSuspend (vm:5048)
2017-07-30 13:49:06,328+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,329+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,343+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,344+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,345+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,357+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,358+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,358+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,370+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,371+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,372+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,384+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,385+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,386+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,398+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,399+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,399+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,411+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,412+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,413+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,425+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,426+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,426+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,438+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,439+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,440+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,452+0300 INFO (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
This mean that we did not get correct allocation from libvirt.
Need to check if this happens only after resizing an online disk, since
extension works in other flows, e.g. provisioning a vm.
I tried to reproduce with: libvirt-*-3.2.0-15.el7_rc.9a49cd47ef.x86_64 (scratch build for bug 1467826) qemu-*-2.9.0-16.el7_4.2.skip_implicit.x86_64 (scratch build for bug 1470634) I could not reproduce the issue, and I see correct allocation values from libvirt, and correct operation in vdsm. Tested with https://gerrit.ovirt.org/79015, rebase on top of https://gerrit.ovirt.org/79974 for improved logging. 1. Starting io inside the vm: I logged into vdsm log to make it easy to understand the flow # echo "`date +'%Y-%m-%d %H:%M:%S,%N'` starting io inside the vm to /dev/sdb1" >> /var/log/vdsm/vdsm.log 2017-07-30 20:50:01,945536470 starting io inside the vm to /dev/sdb1 1. We get correct allocation from libvirt 2017-07-30 20:50:04,861+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=174194688, physical=1073741824) (vm:1073) 2017-07-30 20:50:06,862+0300 DEBUG (periodic/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=364380160, physical=1073741824) (vm:1073) 3. We detect that the disk needs extension 2017-07-30 20:50:08,862+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=578945024, physical=1073741824) (vm:1073) 2017-07-30 20:50:08,863+0300 INFO (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting extension for volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c on domain 6ffbc483-0031-403a-819b-3bb2f0f8de0a (apparent: 1073741824, capacity: 2147483648, allocated: 578945024, physical: 1073741824) (vm:1135) 2017-07-30 20:50:08,863+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting an extension for the volume: {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdc', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518770.25, 'internal': False, 'volumeID': u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c', 'imageID': u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee'} (vm:1212) 2017-07-30 20:50:08,863+0300 INFO (periodic/0) [vdsm.api] START sendExtendMsg(spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', volDict={'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdc', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518770.25, 'internal': False, 'volumeID': u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c', 'imageID': u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee'}, newSize=2147483648, callbackFunc=<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>) from=internal, task_id=66f7b4d1-0c64-4b1a-8ca9-68220c468c0c (api:46) 2017-07-30 20:50:08,863+0300 INFO (periodic/0) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=66f7b4d1-0c64-4b1a-8ca9-68220c468c0c (api:52) 2017-07-30 20:50:08,876+0300 INFO (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/6c99f4e5-8588-46f5-a818-e11151c1d19c/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=1'] (mailbox:387) 4. Vm disk continue to to become full... 2017-07-30 20:50:10,868+0300 DEBUG (periodic/3) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=732037120, physical=1073741824) (vm:1073) 5. Getting extend completed message from spm the logical volume was extended on the SPM but this host does not see the new size yet... 2017-07-30 20:50:10,921+0300 INFO (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 198a9ea5-e9f5-43e7-8cb5-5a8ce04a1c92 (cmd=<function runTask at 0x276f758>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>, {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdc', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518770.25, 'internal': False, 'volumeID': u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c', 'imageID': u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee'})) (threadPool:208) 6. Refreshing the logical volume 2017-07-30 20:50:10,923+0300 INFO (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID=u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', imgUUID=u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee', volUUID=u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c') from=internal, task_id=b6a4b66a-3363-408a-b1db-68fe8c4d5f0b (api:46) 7. Refresh started, the vm does not see the new size yet (physical=1073741824) 2017-07-30 20:50:12,861+0300 DEBUG (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=732037120, physical=1073741824) (vm:1073) 8. The refresh did not complete yet, but the vm see the new size (physical=2147483648) 2017-07-30 20:50:14,859+0300 DEBUG (periodic/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=747765760, physical=2147483648) (vm:1073) 9. Refresh volume completed 2017-07-30 20:50:15,141+0300 INFO (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=b6a4b66a-3363-408a-b1db-68fe8c4d5f0b (api:52) 2017-07-30 20:50:15,142+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extend volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c completed in 6.28 seconds (vm:1175) Here is a another run where extend took 3.3 seconds: 1. Detecting that drive need extend 2017-07-30 20:38:34,748+0300 DEBUG (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive vdb volume 77406f37-729a-420c-85f9-0b3da825236b (capacity=2147483648, allocated=585236480, physical=1073741824) (vm:1073) 2. Sending extend message 2017-07-30 20:38:34,748+0300 INFO (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting extension for volume 77406f37-729a-420c-85f9-0b3da825236b on domain 6ffbc483-0031-403a-819b-3bb2f0f8de0a (apparent: 1073741824, capacity: 2147483648, allocated: 585236480, physical: 1073741824) (vm:1135) 3. Allocation continue to grow... 2017-07-30 20:38:36,753+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive vdb volume 77406f37-729a-420c-85f9-0b3da825236b (capacity=2147483648, allocated=703725568, physical=1073741824) (vm:1073) 4. Got extend completed message from the spm 2017-07-30 20:38:37,022+0300 INFO (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 399f1dda-5c32-4ab7-ab1e-73eb1ad5e283 (cmd=<function runTask at 0x276f758>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>, {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'vdb', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518076.13, 'internal': False, 'volumeID': u'77406f37-729a-420c-85f9-0b3da825236b', 'imageID': u'861f1134-1b98-4d93-9f2c-af43ed8b33a1'})) (threadPool:208) 5. Refreshing the drive... 2017-07-30 20:38:37,024+0300 INFO (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID=u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', imgUUID=u'86 2017-07-30 20:38:37,025+0300 INFO (mailbox-hsm/2) [storage.LVM] Refreshing LVs (vg=6ffbc483-0031-403a-819b-3bb2f0f8de0a, lvs=(u'77406f37-729a-420c-85f9-0b3da825236b',)) (lvm:1322) 2017-07-30 20:38:38,098+0300 INFO (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=e03becf8-9957-43a2-ba4d-35e5f100658e (api:52) 6. Extend completed 2017-07-30 20:38:38,099+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extend volume 77406f37-729a-420c-85f9-0b3da825236b completed in 3.36 seconds (vm:1175) And another run when extend took 9.8 seconds 1. Detecting that drive need extend 2017-07-30 20:17:02,617+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=584187904, physical=1073741824) (vm:1073) 2. Requesting extension 2017-07-30 20:17:02,617+0300 INFO (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting extension for volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 on domain 6ffbc483-0031-403a-819b-3bb2f0f8de0a (apparent: 1073741824, capacity: 2147483648, allocated: 584187904, physical: 1073741824) (vm:1135) 3. Allocation continue to grow... 2017-07-30 20:17:04,616+0300 DEBUG (periodic/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=784596992, physical=1073741824) (vm:1073) 4. Got extend completed message from spm 2017-07-30 20:17:04,679+0300 INFO (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 2e16b2cf-f590-414e-a9d5-720e4bb496e6 (cmd=<function runTask at 0x276f758>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>, {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdb', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6516784.0, 'internal': False, 'volumeID': u'1befe02c-d021-47e8-b4e7-62dcdd6503d6', 'imageID': u'508c3cde-4178-434d-b9ba-22b2053bfd7e'})) (threadPool:208) 5. Refreshing logical volume 2017-07-30 20:17:04,680+0300 INFO (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID=u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', imgUUID=u'508c3cde-4178-434d-b9ba-22b2053bfd7e', volUUID=u'1befe02c-d021-47e8-b4e7-62dcdd6503d6') from=internal, task_id=de92fe0e-a5c7-4a10-9272-fc4495468e6d (api:46) 6. Vm does not see the new size yet... 2017-07-30 20:17:08,621+0300 DEBUG (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=900296704, physical=1073741824) (vm:1073) 7. Vm see the new size 2017-07-30 20:17:10,617+0300 DEBUG (periodic/3) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=1029173248, physical=2147483648) (vm:1073) 8. Refresh completed 2017-07-30 20:17:12,441+0300 INFO (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=de92fe0e-a5c7-4a10-9272-fc4495468e6d (api:52) 2017-07-30 20:17:12,442+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extend volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 completed in 9.83 seconds (vm:1175) The longest part of the extend is the lvm refresh operation. - extend: 2 seconds - refresh: 7.8 seconds (vm see new size: 6 seconds) We need better logging showing how much time took each part of the flow, so we can collect more data. I tested with my poor storage, giving up to 70 MiB/s write rate inside the vm. It is possible that using faster storage the vm would pause while copying, but this is not news. Avihay, please test again with master, using this patch: https://gerrit.ovirt.org/79015 Please set "virt" logger log level to DEBUG: vdsm-client Host setLogLevel name=virt level=DEBUG You will have a log with extension info every 2 seconds for each drive: 2017-07-30 22:49:53,664+0300 DEBUG (periodic/0) [virt.vm] (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extension info for drive sdb volume 5aa55005-2922-41ed-8f2a-3072d0d86f02 (capacity=2147483648, allocated=641925120, physical=1073741824) (vm:1073) When extending a volume completes, we will have this log: 2017-07-30 22:49:55,609+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extend volume 5aa55005-2922-41ed-8f2a-3072d0d86f02 completed in 3.95 seconds (extend=2.12 seconds, refresh=1.83 seconds) (vm:1187) It would be helpful to run only this test in automation at least 10 times with 7.3 and 7.4. (In reply to Nir Soffer from comment #43) > Avihay, this test contains a thin disk extend before writing io to disk. No , we do the extend BEFORE writing to disk . Steps to Reproduce: 1.Clone VM from template with OS from iscsi/FC storage domain. 2.create additional thin disk from iscsi/FC with: size = 1G format = QCOW sparse = True 3.Extend the created disk with an additonal 1G ( New total size 2G) 4.Create partition + FS (ext4) on new disk 5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~= 1358M), for example in the local run I saw: > Do you have another test that write to disk without extending it? No but if it help I can modify current test & check it out . > For example: > > 1. Create thin disk of 2g > 2. Attach disk to vm > 3. Create file system and mount the disk > 4. Copy 1.3g of data into new disk > Does this fail, or only when we extend the disk before the test it fails? Will test & update. (In reply to Avihai from comment #47) > (In reply to Nir Soffer from comment #43) > > Avihay, this test contains a thin disk extend before writing io to disk. > No , we do the extend BEFORE writing to disk . > > Steps to Reproduce: > 1.Clone VM from template with OS from iscsi/FC storage domain. > 2.create additional thin disk from iscsi/FC with: > size = 1G > format = QCOW > sparse = True > > 3.Extend the created disk with an additonal 1G ( New total size 2G) > 4.Create partition + FS (ext4) on new disk > 5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~= > 1358M), for example in the local run I saw: > > > > Do you have another test that write to disk without extending it? > No but if it help I can modify current test & check it out . > > > For example: > > > > 1. Create thin disk of 2g > > 2. Attach disk to vm > > 3. Create file system and mount the disk > > 4. Copy 1.3g of data into new disk > > > > > Does this fail, or only when we extend the disk before the test it fails? > Will test & update. Issue (VM pause) still occur without extend in your suggested scenario. Engine: 2017-07-31 08:48:46,897+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM vm_TestCase5063_REST_ISCSI_3108445080 has been paused due to no Storage space error. VDSM: 2017-07-31 08:48:46,644+0300 INFO (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') abnormal vm stop device virtio-disk1 error enospc (vm:4501) 2017-07-31 08:48:46,645+0300 INFO (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') CPU stopped: onIOError (vm:5431) 2017-07-31 08:48:46,655+0300 INFO (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') No VM drives were extended (vm:4508) 2017-07-31 08:48:46,656+0300 INFO (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') CPU stopped: onSuspend (vm:5431) 2017-07-31 08:48:46,657+0300 INFO (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') abnormal vm stop device virtio-disk1 error eother (vm:4501) 2017-07-31 08:48:46,657+0300 INFO (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') CPU stopped: onIOError (vm:5431) Logs attached. Created attachment 1306809 [details]
Test_without_extend_logs_4.2.0-0.0.master.2017072314102
Added log to the suggested scenario without extend .
Engine:
4.2.0-0.0.master.20170723141021.git463826a.el7.centos
VDSM:
4.20.1-269
(In reply to Nir Soffer from comment #46) > Avihay, please test again with master, using this patch: > https://gerrit.ovirt.org/79015 > > Please set "virt" logger log level to DEBUG: > > vdsm-client Host setLogLevel name=virt level=DEBUG > > You will have a log with extension info every 2 seconds for each drive: > > 2017-07-30 22:49:53,664+0300 DEBUG (periodic/0) [virt.vm] > (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extension info for drive sdb > volume 5aa55005-2922-41ed-8f2a-3072d0d86f02 (capacity=2147483648, > allocated=641925120, physical=1073741824) (vm:1073) > > When extending a volume completes, we will have this log: > > 2017-07-30 22:49:55,609+0300 INFO (mailbox-hsm/2) [virt.vm] > (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extend volume > 5aa55005-2922-41ed-8f2a-3072d0d86f02 completed in 3.95 seconds (extend=2.12 > seconds, refresh=1.83 seconds) (vm:1187) > > It would be helpful to run only this test in automation at least 10 times > with > 7.3 and 7.4. Nir , I ran same scenario as in original issue on this patch twice on 7.4 with virt DEBUG enabled & issue reporoduce on both runs. See logs to 7.4 run with DEBUG attached. I do not have a 7.3 hosts setup at the moment - is this critical? Created attachment 1307036 [details]
engine & vdsm logs no patch 79015 with virt DEBUG enabled
(In reply to Avihai from comment #50) > Nir , I ran same scenario as in original issue on this patch twice on 7.4 > with virt DEBUG enabled & issue reporoduce on both runs. Thanks, will look at the logs. > I do not have a 7.3 hosts setup at the moment - is this critical? It is to tell if we have a regression in 7.4. (In reply to Avihai from comment #51) > Created attachment 1307036 [details] > engine & vdsm logs no patch 79015 with virt DEBUG enabled Avihay, I need to time when io was started - best way is to include this in vdsm logs like I did in comment 45, or provide the time separately. Avihay, your are using a broken version of libvirt in your reproduction: $ grep 'Extension info for drive vdb' vdsm.log | grep -v allocated=0 No output $ grep 'Extension info for drive vdb' vdsm.log | head 2017-07-31 14:29:40,768+0300 DEBUG (periodic/3) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:42,758+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:44,755+0300 DEBUG (periodic/0) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:46,762+0300 DEBUG (periodic/2) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:48,756+0300 DEBUG (periodic/3) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:51,242+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:52,750+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:54,748+0300 DEBUG (periodic/3) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:56,823+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073) 2017-07-31 14:29:58,782+0300 DEBUG (periodic/0) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073) If libvirt does not tell us about the allocation, we cannot extend. Created attachment 1307188 [details]
logs with new libvirt
Added logs for passed test.
Time dd started 2017-07-31 18:11:28
Avihay, looking at logs from attachment 1307188 [details], everything looks ok. This means also
that previous tests are invalid, we used the wrong libvirt version.
1. We detect that a drive should be extended
2017-07-31 18:11:41,316+0300 DEBUG (periodic/2) [virt.vm] (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Requesting an extension for the volume: {'newSize': 2147483648, 'domainID': '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305203.34, 'internal': False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': '8e7a842c-993b-47a6-978c-c3a3de652a42'} (vm:1224)
2. We get extend completed message from spm
2017-07-31 18:11:41,395+0300 INFO (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 5fd68ec2-dd7b-4536-88f6-0d04dac32a9f (cmd=<function runTask at 0x2e9d1b8>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x3fe6410>>, {'newSize': 2147483648, 'domainID': '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305201.31, 'internal': False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': '8e7a842c-993b-47a6-978c-c3a3de652a42'})) (threadPool:208)
3. Refreshing volume
2017-07-31 18:11:41,398+0300 INFO (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID='84b5768a-ca74-4ce7-ada3-7961f43f69f1', spUUID='8a8b6a4a-3123-4785-ae56-ccc4339493af', imgUUID='8e7a842c-993b-47a6-978c-c3a3de652a42', volUUID='920aaefd-1aa1-40d2-bdaa-08c299657acc') from=internal, task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:46)
2017-07-31 18:11:41,623+0300 INFO (mailbox-hsm/2) [storage.LVM] Refreshing LVs (vg=84b5768a-ca74-4ce7-ada3-7961f43f69f1, lvs=('920aaefd-1aa1-40d2-bdaa-08c299657acc',)) (lvm:1322)
2017-07-31 18:11:41,904+0300 INFO (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:52)
4. Extend completed
2017-07-31 18:11:41,905+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11 seconds, refresh=0.50 seconds) (vm:1187)
Can you run this test at least 10 times on the same setup, so we can compare
failures rates with 7.3?
Since we don't have yet any real failure with current libvirt, I think we should move this to 4.1.5. (In reply to Nir Soffer from comment #59) > Avihay, looking at logs from attachment 1307188 [details], everything looks > ok. This means also > that previous tests are invalid, we used the wrong libvirt version. > > 1. We detect that a drive should be extended > > 2017-07-31 18:11:41,316+0300 DEBUG (periodic/2) [virt.vm] > (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Requesting an extension for > the volume: {'newSize': 2147483648, 'domainID': > '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': > '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305203.34, 'internal': > False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': > '8e7a842c-993b-47a6-978c-c3a3de652a42'} (vm:1224) > > 2. We get extend completed message from spm > > 2017-07-31 18:11:41,395+0300 INFO (mailbox-hsm/2) > [storage.ThreadPool.WorkerThread] START task > 5fd68ec2-dd7b-4536-88f6-0d04dac32a9f (cmd=<function runTask at 0x2e9d1b8>, > args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at > 0x3fe6410>>, {'newSize': 2147483648, 'domainID': > '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': > '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305201.31, 'internal': > False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': > '8e7a842c-993b-47a6-978c-c3a3de652a42'})) (threadPool:208) > > 3. Refreshing volume > > 2017-07-31 18:11:41,398+0300 INFO (mailbox-hsm/2) [vdsm.api] START > refreshVolume(sdUUID='84b5768a-ca74-4ce7-ada3-7961f43f69f1', > spUUID='8a8b6a4a-3123-4785-ae56-ccc4339493af', > imgUUID='8e7a842c-993b-47a6-978c-c3a3de652a42', > volUUID='920aaefd-1aa1-40d2-bdaa-08c299657acc') from=internal, > task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:46) > > 2017-07-31 18:11:41,623+0300 INFO (mailbox-hsm/2) [storage.LVM] Refreshing > LVs (vg=84b5768a-ca74-4ce7-ada3-7961f43f69f1, > lvs=('920aaefd-1aa1-40d2-bdaa-08c299657acc',)) (lvm:1322) > > 2017-07-31 18:11:41,904+0300 INFO (mailbox-hsm/2) [vdsm.api] FINISH > refreshVolume return=None from=internal, > task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:52) > > 4. Extend completed > > 2017-07-31 18:11:41,905+0300 INFO (mailbox-hsm/2) [virt.vm] > (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume > 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11 > seconds, refresh=0.50 seconds) (vm:1187) > > Can you run this test at least 10 times on the same setup, so we can compare > failures rates with 7.3? 1) Current setup is on 4.2 + new libvirt+ you patch (https://gerrit.ovirt.org/79015) -> do you want 10 runs on this setup ? 2) Do you have the 7.3 rates for comparison or is this also needed with 10 additional runs (take in consideration that we currently do not have RHEL7.3 stand)? (In reply to Nir Soffer from comment #59) > Avihay, looking at logs from attachment 1307188 [details], everything looks > ok. This means also > that previous tests are invalid, we used the wrong libvirt version. > > 1. We detect that a drive should be extended > > 2017-07-31 18:11:41,316+0300 DEBUG (periodic/2) [virt.vm] > (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Requesting an extension for > the volume: {'newSize': 2147483648, 'domainID': > '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': > '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305203.34, 'internal': > False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': > '8e7a842c-993b-47a6-978c-c3a3de652a42'} (vm:1224) > > 2. We get extend completed message from spm > > 2017-07-31 18:11:41,395+0300 INFO (mailbox-hsm/2) > [storage.ThreadPool.WorkerThread] START task > 5fd68ec2-dd7b-4536-88f6-0d04dac32a9f (cmd=<function runTask at 0x2e9d1b8>, > args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at > 0x3fe6410>>, {'newSize': 2147483648, 'domainID': > '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': > '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305201.31, 'internal': > False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': > '8e7a842c-993b-47a6-978c-c3a3de652a42'})) (threadPool:208) > > 3. Refreshing volume > > 2017-07-31 18:11:41,398+0300 INFO (mailbox-hsm/2) [vdsm.api] START > refreshVolume(sdUUID='84b5768a-ca74-4ce7-ada3-7961f43f69f1', > spUUID='8a8b6a4a-3123-4785-ae56-ccc4339493af', > imgUUID='8e7a842c-993b-47a6-978c-c3a3de652a42', > volUUID='920aaefd-1aa1-40d2-bdaa-08c299657acc') from=internal, > task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:46) > > 2017-07-31 18:11:41,623+0300 INFO (mailbox-hsm/2) [storage.LVM] Refreshing > LVs (vg=84b5768a-ca74-4ce7-ada3-7961f43f69f1, > lvs=('920aaefd-1aa1-40d2-bdaa-08c299657acc',)) (lvm:1322) > > 2017-07-31 18:11:41,904+0300 INFO (mailbox-hsm/2) [vdsm.api] FINISH > refreshVolume return=None from=internal, > task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:52) > > 4. Extend completed > > 2017-07-31 18:11:41,905+0300 INFO (mailbox-hsm/2) [virt.vm] > (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume > 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11 > seconds, refresh=0.50 seconds) (vm:1187) > > Can you run this test at least 10 times on the same setup, so we can compare > failures rates with 7.3? I ran 10 times on the same setup (RHEL7.4 , new libvirt, your patch) . Logs are attached Created attachment 1307513 [details]
10 runs logs on RHEL7_4 with new libvirt & patch
Avihay, did you reproduce the issue? the previous build seems to be fine. Looking in the logs from attachment 1307513 [details]: $ grep "Extend volume" vdsm.log 2017-07-31 18:11:41,905+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11 seconds, refresh=0.50 seconds) (vm:1187) 2017-07-31 18:19:02,305+0300 INFO (mailbox-hsm/4) [virt.vm] (vmId='f9e0b544-4c50-45a8-9ef1-f899e927eba7') Extend volume e7c20240-9045-4d13-b74a-7872daf18f5a completed in 2.72 seconds (extend=2.14 seconds, refresh=0.58 seconds) (vm:1187) 2017-07-31 18:30:00,511+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='93dfee3f-36fd-4f5e-b619-85a8b8a76171') Extend volume 55931609-dadb-4a9e-a8b4-d0acddaff678 completed in 2.66 seconds (extend=2.09 seconds, refresh=0.57 seconds) (vm:1187) 2017-08-01 11:51:06,012+0300 INFO (mailbox-hsm/1) [virt.vm] (vmId='a242043a-96ce-4224-a5f4-8ec486aa6bfb') Extend volume c26196cf-420d-47b1-afbc-1eed99128e09 completed in 2.49 seconds (extend=2.10 seconds, refresh=0.39 seconds) (vm:1187) 2017-08-01 11:55:28,238+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='e06d0f3c-fb61-48b6-afe3-cd050c37694f') Extend volume 0b17fd30-33b1-46c9-875d-a80e4744cfc5 completed in 4.73 seconds (extend=4.12 seconds, refresh=0.61 seconds) (vm:1187) 2017-08-01 12:13:34,515+0300 INFO (mailbox-hsm/2) [virt.vm] (vmId='8add541b-020a-471f-9e80-26ae03453bf5') Extend volume 373cb281-6d28-4858-9e9e-6df9240be0ce completed in 2.64 seconds (extend=2.10 seconds, refresh=0.54 seconds) (vm:1187) 2017-08-01 12:36:20,938+0300 INFO (mailbox-hsm/1) [virt.vm] (vmId='195b8335-d352-4063-99b9-bf87fd765e63') Extend volume e4631228-7ed9-449b-b56a-ecb6fd34dbe5 completed in 2.58 seconds (extend=2.08 seconds, refresh=0.50 seconds) (vm:1187) 2017-08-01 12:45:27,140+0300 INFO (mailbox-hsm/1) [virt.vm] (vmId='ba4d8d03-46b1-4335-96d7-69aa6971fc9e') Extend volume 2f6ec7e1-15f2-43fe-9c78-3617caff74cc completed in 2.58 seconds (extend=2.13 seconds, refresh=0.45 seconds) (vm:1187) We had 8 extensions - maybe the test run only 8 times instead of 10? or maybe vdsm log is partial? All extends are looks similar, one was longer. This is expected since we check for messages every 2 seconds, if the spm sent a message just after we checked, we will see the message 2 seconds later. refresh times look short, I don't see the random and long refresh times as seen on my host. $ egrep "vm stop device .+ error enospc" vdsm.log (no output) Avihay, I don't see any pause, I think we can safely close this bug. The pauses you reported were a result of broken libvirt as explained in comment 32. (In reply to Nir Soffer from comment #68) > Avihay, did you reproduce the issue? the previous build seems to be fine. No , issue did not reproduce at any run (over 10+ runs)using new libvirt . Verified . Engine: ovirt-engine-4.1.4-0.2.el7.noarch Host: VDSM :4.19.25-1 New libvirt: libvirt-3.2.0-14.el7_4.2.x86_64 Changing target per Tal request. Revert milestone per Allon request, changing target instead. I backported the logging improvements added when debugging this issue, so we can have better logging in 4.1. |