Bug 1461536

Summary:

[TestOnly] RHEL 7.4 | VM pause running IO after extending block(ISCSI/FCP) spare qcow disk

Product:

[oVirt] vdsm

Reporter:

Avihai <aefrat>

Component:

Core

Assignee:

Nir Soffer <nsoffer>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Avihai <aefrat>

Severity:

high

Docs Contact:

Priority:

high

Version:

---

CC:

aefrat, amureini, bugs, coli, dyuan, eberman, jiyan, juzhang, lmen, nsoffer, pkrempa, ratamir, xuzhang, ykaul, ylavi

Target Milestone:

ovirt-4.1.4-1

Keywords:

Regression, Reopened, TestOnly

Target Release:

4.19.25

Flags:

rule-engine: ovirt-4.1+
rule-engine: blocker+

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-08-23 08:02:57 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1470127, 1473145

Bug Blocks:

Attachments:

Description	Flags
engine & vdsm logs	none
log in /var/log/libvirt/libvirtd.log from lmen	none
error info in the guest from lmen	none
new_logs	none
Test_without_extend_logs_4.2.0-0.0.master.2017072314102	none
engine & vdsm logs no patch 79015 with virt DEBUG enabled	none
logs with new libvirt	none
10 runs logs on RHEL7_4 with new libvirt & patch	none

Description Avihai 2017-06-14 17:38:39 UTC

Created attachment 1287728 [details]
engine & vdsm logs

Description of problem:
VM pause running IO after extending block(ISCSI/FCP) spare qcow format disk on rhel7.4

Version-Release number of selected component (if applicable):
Engine :
ovirt-engine-4.1.3.2-0.1.el7.noarch

VDSM:
vdsm-4.19.18-1.el7ev.x86_64

All hosts & Engine are with rhel7.4 :
Red Hat Enterprise Linux Server 7.4 Beta (Maipo)

How reproducible:
Occurs once in both FCP & ISCSI in automation Tier1 run on RHEL7.4 ! :
https://rhev-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/rhv-4.1v3-ge-runner-tier1/30/

Did not occur when I ran same test on the same build on another stand with RHEL7.3 !

Steps to Reproduce:
1.Clone VM from template with OS from iscsi/FC storage domain.
2.create additional thin disk from iscsi/FC with:
size = 1G
format = QCOW
sparse = True

3.Extend the created disk with an additonal 1G ( New total size 2G)
4.Create partition + FS (ext4) on new disk
5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~= 1358M), for example in the local run I saw:

2017-06-14 16:23:51,545 - MainThread - rhevmtests.storage.helpers - INFO - Performing command '/bin/dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none'


Actual results:
During write IOs(dd) to the disk VM pause .


Expected results:
VM should not pause during dd .

Additional info:
Time line of same test Failing on FCP &ISCSI  :
11:45:19 2017-06-11 11:45:19,048 INFO   STORAGE: FCP
11:45:19 2017-06-11 11:45:19,048 INFO   POLARION: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-5063
11:46:44 2017-06-11 11:46:44,124 INFO    Test step      1: Resizing disk disk_5063
11:53:13 2017-06-11 11:53:13,391 INFO    Test step      2: Performing 'dd' command to extended disk disk_5063
11:53:29 2017-06-11 11:53:29,354 ERROR Result: FAILED
11:53:29 2017-06-11 11:53:29,354 INFO ================================================================================
11:53:29 F2017-06-11 11:53:29,360 INFO 092: storage/rhevmtests.storage.storage_virtual_disk_resize.test_virtual_disk_resize.TestCase5063_REST_ISCSI.test_thin_block_resize
11:53:29 2017-06-11 11:53:29,360 INFO     - VM with thin disk and OS
11:53:29 2017-06-11 11:53:29,360 INFO     - Resize the VM disk to 2G total
11:53:29 2017-06-11 11:53:29,360 INFO     - Send IOs to disk
11:53:29 2017-06-11 11:53:29,360 INFO     - Check LV size on VDSM and disk size on guest
11:53:29 2017-06-11 11:53:29,360 INFO   API: REST
11:53:29 2017-06-11 11:53:29,360 INFO   STORAGE: ISCSI
11:53:29 2017-06-11 11:53:29,360 INFO   POLARION: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-5063
11:55:05 2017-06-11 11:55:05,529 INFO    Test step      1: Resizing disk disk_5063
12:01:38 2017-06-11 12:01:38,163 INFO    Test step      2: Performing 'dd' command to extended disk disk_5063
12:01:55 2017-06-11 12:01:55,427 ERROR Result: FAILED


Engine log :
resize was done at at host_mixed_2:

2017-06-11 11:46:52,878+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVolumeInfoVDSCommand] (org.ovirt.thread.pool-6-thread-13) [] START, GetVolumeInfoVDSCommand(HostName = host_mixed_2, GetVolumeInfoVDSCommandParameters:{runAsyn
c='true', hostId='405b7998-e00b-4914-8619-0df4e32f20b9', storagePoolId='d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd', storageDomainId='df501232-7b90-4c2a-b0a6-6ed4570f3448', imageGroupId='2a555c74-e36b-4345-9afd-65b181c80001', imageId='277075e4-
d02c-4f9c-8347-42c62515b012'}), log id: 603b2825

2017-06-11 11:46:54,008+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-13) [] EVENT_ID: USER_EXTEND_DISK_SIZE_SUCCESS(371), Correlation ID: null, Call Stack: null, Custom Ev
ent ID: -1, Message: Size of the disk 'disk_5063' was successfully updated to 2 GB by admin@internal-authz.

2017-06-11 11:46:54,035+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-13) [] EVENT_ID: USER_UPDATE_VM_DISK(88), Correlation ID: disks_update_63c5069c-a77f-4656, Call Stack:
 null, Custom Event ID: -1, Message: VM virtual_disk_resize_fcp_5063 disk_5063 disk was updated by admin@internal-authz.

2017-06-11 11:47:21,786+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-13) [] VM 'ca4c86cb-dba3-4feb-8d83-5f864b24c487'(virtual_disk_resize_fcp_5063) moved from 'Up' --> 'Paused'
2017-06-11 11:47:21,855+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED(1,025), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vir
tual_disk_resize_fcp_5063 has been paused.
2017-06-11 11:47:21,883+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-13) [] EVENT_ID: VM_PAUSED_ENOSPC(138), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM virtual_disk_resize_fcp_5063 has been paused due to no Storage space error.

2017-06-11 11:56:16,597+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [7327ead7] EVENT_ID: VM_PAUSED_ENOSPC(138), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM virtual_disk_resize_iscsi_5063 has been paused due to no Storage space error.

2017-06-11 12:01:40,189+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-29) [vms_syncAction_83f9b981-3de2-491d] EVENT_ID: USER_STOP_VM(33), Correlation ID: vms_syncAction_83f9b981-3de2-491d, Job ID: a9017f47-5a09-4c39-97fe-7c8b73353e34, Call Stack: null, Custom Event ID: -1, Message: VM virtual_disk_resize_iscsi_5063 powered off by admin@internal-authz (Host: host_mixed_2).
2017-06-11 12:01:47,670+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmsStatisticsFetcher] (DefaultQuartzScheduler4) [7f26dbf3] Fetched 0 VMs from VDS '405b7998-e00b-4914-8619-0df4e32f20b9'
2017-06-11 12:01:47,670+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler4) [7f26dbf3] VM 'bfe2d827-42ea-4b5e-ac11-e033e30d99ee'(virtual_disk_resize_iscsi_5063) is running in db and not running on VDS '405b7998-e00b-4914-8619-0df4e32f20b9'(host_mixed_2)
2017-06-11 12:01:47,681+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler4) [7f26dbf3] EVENT_ID: VM_DOWN(61), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM virtual_disk_resize_iscsi_5063 is down.  

VDSM (host_mixed_2) :
Extend occurs :
2017-06-11 11:46:44,460+0300 INFO  (jsonrpc/6) [vdsm.api] START extendVolumeSize(spUUID='d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd', sdUUID='df501232-7b90-4c2a-b0a6-6ed4570f3448', imgUUID='2a555c74-e36b-4345-9afd-65b181c80001', volUUID='277075e4-d02c-4f9c-8347-42c62515b012', newSize='2147483648') from=::ffff:10.35.69.94,45266, flow_id=disks_update_63c5069c-a77f-4656 (api:46)
2017-06-11 11:46:44,500+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='4da2373b-dd7c-4391-b007-ab58a16b46d9') Task.run: running job 0: extendVolumeSize: <bound method StoragePool.extendVolumeSize of <storage.sp.StoragePool object 
at 0x30eb210>> (args: ('df501232-7b90-4c2a-b0a6-6ed4570f3448', '2a555c74-e36b-4345-9afd-65b181c80001', '277075e4-d02c-4f9c-8347-42c62515b012', 4194304) kwargs: {}) (task:909)
2017-06-11 11:46:44,491+0300 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Volume.extendSize succeeded in 0.03 seconds (__init__:539)

2017-06-11 11:46:44,500+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='4da2373b-dd7c-4391-b007-ab58a16b46d9') Job.run: running extendVolumeSize: <bound method StoragePool.extendVolumeSize of <storage.sp.StoragePool object at 0x30e
b210>> (args: ('df501232-7b90-4c2a-b0a6-6ed4570f3448', '2a555c74-e36b-4345-9afd-65b181c80001', '277075e4-d02c-4f9c-8347-42c62515b012', 4194304) kwargs: {}) callback None (task:330)

VDSM Pause occured :
2017-06-11 11:47:19,453+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00361567 s, 283 MB/s\n'; <rc> = 0 (commands:93)
2017-06-11 11:47:21,230+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') abnormal vm stop device virtio-disk1 error enospc (vm:4166)
2017-06-11 11:47:21,231+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onIOError (vm:5045)
2017-06-11 11:47:21,468+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-11 dd if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None
) (commands:69)
2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93)
2017-06-11 11:47:21,508+0300 WARN  (periodic/1) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on ['ca4c86cb-dba3-4feb-8d83-5f864b24c487'] (periodic:301)
2017-06-11 11:47:21,776+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') No VM drives were extended (vm:4173)
2017-06-11 11:47:21,777+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onSuspend (vm:5045)

Comment 1 Yaniv Kaul 2017-06-15 06:44:12 UTC

Eyal, please work with Aviahi and let's ensure it is the same issue we are familiar with.

Comment 2 Allon Mureinik 2017-06-15 12:14:35 UTC

(In reply to Yaniv Kaul from comment #1)
> Eyal, please work with Aviahi and let's ensure it is the same issue we are
> familiar with.
Link?

Comment 3 Avihai 2017-06-15 12:18:12 UTC

Guy Chen is looking at this issue & will respond shortly.

More info on my side:
Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total size of 1328MB with bs=1MB


https://bugzilla.redhat.com/show_bug.cgi?id=1461536
By VSDM log timeline :
2017-06-11 11:46:44 Extend suceeded
2017-06-11 11:46:45 dd started 
2017-06-11 11:47:21 VM pause

Time dd ran = ~45 sec  , which if extend was not done, filled 1GB & paused the VM.

Example of the dd done:
> dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none'

Comment 4 Yaniv Kaul 2017-06-15 12:20:57 UTC

(In reply to Allon Mureinik from comment #2)
> (In reply to Yaniv Kaul from comment #1)
> > Eyal, please work with Aviahi and let's ensure it is the same issue we are
> > familiar with.
> Link?

https://bugzilla.redhat.com/show_bug.cgi?id=1408594

Comment 5 Yaniv Kaul 2017-06-15 12:24:56 UTC

(In reply to Avihai from comment #3)
> Guy Chen is looking at this issue & will respond shortly.
> 
> More info on my side:
> Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total
> size of 1328MB with bs=1MB
> 
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1461536
> By VSDM log timeline :
> 2017-06-11 11:46:44 Extend suceeded

Is this extend relevant?

> 2017-06-11 11:46:45 dd started 
> 2017-06-11 11:47:21 VM pause
> 
> Time dd ran = ~45 sec  , which if extend was not done, filled 1GB & paused
> the VM.

45s for 1GB is not very fast. However, what is interesting is when do we fill >50%, how long it takes for the whole extending process. You can find out via the VDSM logs - the whole flow, through the mailbox, etc. 45secs is quite a log actually.

> 
> Example of the dd done:
> > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none'

This is not a very good 'dd' command. For example, it uses caching.

Comment 6 Avihai 2017-06-15 14:22:15 UTC

(In reply to Yaniv Kaul from comment #5)
> (In reply to Avihai from comment #3)
> > Guy Chen is looking at this issue & will respond shortly.
> > 
> > More info on my side:
> > Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total
> > size of 1328MB with bs=1MB
> > 
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1461536
> > By VSDM log timeline :
> > 2017-06-11 11:46:44 Extend suceeded
> 
> Is this extend relevant?
Yes, this is my initiated extent done on the disk from 1G->2G ,
why is this not relevant?

You can see bugs in original description in VDSM log that it finished successfully at 11:46:44.

> > 2017-06-11 11:46:45 dd started 
> > 2017-06-11 11:47:21 VM pause
> > 
> > Time dd ran = ~45 sec  , which if extend was not done, filled 1GB & paused
> > the VM.
> 
> 45s for 1GB is not very fast. However, what is interesting is when do we
> fill >50%, how long it takes for the whole extending process. You can find
> out via the VDSM logs - the whole flow, through the mailbox, etc. 45secs is
> quite a log actually.

Right before VM pause I see the following mailbox-spm messages in VDSM log, but these messages are all over the log before the dd started, how do you know from them when extend was done? 

From VDSM log :
2017-06-11 11:47:19,434+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-11 dd if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69)
2017-06-11 11:47:19,453+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00361567 s, 283 MB/s\n'; <rc> = 0 (commands:93)
2017-06-11 11:47:21,230+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') abnormal vm stop device virtio-disk1 error enospc (vm:4166)
2017-06-11 11:47:21,231+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onIOError (vm:5045)
2017-06-11 11:47:21,468+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-11 dd if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69)
2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93)
2017-06-11 11:47:21,508+0300 WARN  (periodic/1) [virt.periodic.VmDispatcher] could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on ['ca4c86cb-dba3-4feb-8d83-5f864b24c487'] (periodic:301)
2017-06-11 11:47:21,776+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') No VM drives were extended (vm:4173)
2017-06-11 11:47:21,777+0300 INFO  (libvirt/events) [virt.vm] (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onSuspend (vm:5045)


> > Example of the dd done:
> > > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none'
> 
> This is not a very good 'dd' command. For example, it uses caching.
I know client caching is not disabled as "-oflag=direct" is not used in the dd here but this is a functional test & we are not testing performance so in the worst case caching should make a faster write IO/s here but we should be able to handle it.

Comment 7 Yaniv Kaul 2017-06-15 14:28:48 UTC

(In reply to Avihai from comment #6)
> (In reply to Yaniv Kaul from comment #5)
> > (In reply to Avihai from comment #3)
> > > Guy Chen is looking at this issue & will respond shortly.
> > > 
> > > More info on my side:
> > > Guy ran fio that writes 10 files of 1 GB each, I ran with a one dd total
> > > size of 1328MB with bs=1MB
> > > 
> > > 
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1461536
> > > By VSDM log timeline :
> > > 2017-06-11 11:46:44 Extend suceeded
> > 
> > Is this extend relevant?
> Yes, this is my initiated extent done on the disk from 1G->2G ,
> why is this not relevant?

Because we'd like to understand how long the next extend took. Although you could compare it to this extend - how much time did it take?

> 
> You can see bugs in original description in VDSM log that it finished
> successfully at 11:46:44.
> 
> > > 2017-06-11 11:46:45 dd started 
> > > 2017-06-11 11:47:21 VM pause
> > > 
> > > Time dd ran = ~45 sec  , which if extend was not done, filled 1GB & paused
> > > the VM.
> > 
> > 45s for 1GB is not very fast. However, what is interesting is when do we
> > fill >50%, how long it takes for the whole extending process. You can find
> > out via the VDSM logs - the whole flow, through the mailbox, etc. 45secs is
> > quite a log actually.
> 
> Right before VM pause I see the following mailbox-spm messages in VDSM log,
> but these messages are all over the log before the dd started, how do you
> know from them when extend was done? 

Check the SPM logs - he's the one extending.

> 
> From VDSM log :
> 2017-06-11 11:47:19,434+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd]
> /usr/bin/taskset --cpu-list 0-11 dd
> if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/
> inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69)
> 2017-06-11 11:47:19,453+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd]
> SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB)
> copied, 0.00361567 s, 283 MB/s\n'; <rc> = 0 (commands:93)
> 2017-06-11 11:47:21,230+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') abnormal vm stop device
> virtio-disk1 error enospc (vm:4166)

This is already too late, I reckon.

> 2017-06-11 11:47:21,231+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onIOError
> (vm:5045)
> 2017-06-11 11:47:21,468+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd]
> /usr/bin/taskset --cpu-list 0-11 dd
> if=/rhev/data-center/d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd/mastersd/dom_md/
> inbox iflag=direct,fullblock count=1 bs=1024000 (cwd None) (commands:69)
> 2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd]
> SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB)
> copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93)
> 2017-06-11 11:47:21,508+0300 WARN  (periodic/1) [virt.periodic.VmDispatcher]
> could not run <class 'vdsm.virt.periodic.DriveWatermarkMonitor'> on
> ['ca4c86cb-dba3-4feb-8d83-5f864b24c487'] (periodic:301)
> 2017-06-11 11:47:21,776+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') No VM drives were extended
> (vm:4173)

Not sure what this is...

> 2017-06-11 11:47:21,777+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='ca4c86cb-dba3-4feb-8d83-5f864b24c487') CPU stopped: onSuspend
> (vm:5045)
> 
> 
> > > Example of the dd done:
> > > > dd bs=1M count=1358 if=/dev/vda of=/mount-point8dd63c4ee37471c9e7e210f415ad88c1d6a79040/written_test_storage status=none'
> > 
> > This is not a very good 'dd' command. For example, it uses caching.
> I know client caching is not disabled as "-oflag=direct" is not used in the
> dd here but this is a functional test & we are not testing performance so in
> the worst case caching should make a faster write IO/s here but we should be
> able to handle it.


I believe caching is (sometimes) bad as it does a burst of IO - it will dump all pending IO at once when sync'ing. If you do not use caching, it hopefully will not burst. We have less chance of dealing with a burst of IO, that's all.

Comment 10 lijuan men 2017-06-23 10:24:02 UTC

I am a libvirt qe

I try to use the simplest scenario without RHV,iscsi,only used libvirt and lv,can reproduce this issue.

Maybe this issue is related to the resize operation on *lv* and error_policy='stop'(in RHV,the disk also uses this value)

resize operation causes the IO error,and error_policy='stop' causes the guest paused.

version:
libvirt-3.2.0-15.el7a.x86_64
qemu-kvm-rhev-2.9.0-14.el7.x86_64

steps:
1.prepare a 1G qcow2 lv

[root@lmen1 ~]#  lvcreate -L 1G -n lvv vgg
  Logical volume "lvv" created.

[root@lmen1 ~]# qemu-img convert -O qcow2 /dev/vgg/lvv /dev/vgg/lvv


2. start a guest with:
 <disk type='block' device='disk'>
      <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/>
      <source dev='/dev/vgg/lvv'/>
      <target dev='sdb' bus='scsi'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>

[root@lmen1 ~]# virsh start test
Domain test started

in the guest,there is the 1G disk sdb

3.resize the disk

[root@lmen1 ~]#  virsh blockresize test --path /dev/vgg/lvv --size 2G
Block device '/dev/vgg/lvv' is resized

in the guest,the size of sdb is 2G

4.run dd command in the guest

in the guest,run "dd if=/dev/zero of=/dev/sdb bs=1M count=1358"

5.check the guest status in the host

[root@lmen1 ~]# virsh list
 Id    Name                           State
----------------------------------------------------
 33    test                           paused

the log info is uploaded as the attachment1 [details].

PS:
I have tried other scenarios:

1.change error_policy as 'report' in step 2,and repeat all the the steps above 
  result: the guest is still in running status,but in the guest,some IO errors are output,the error info is uploaded as the 2nd attachment.
          So I think the paused status is caused by error_policy='stop'.

2.keep error_policy as 'stop',replace lv with '/var/lib/libvirt/images/lmen.qcow2',and repeat the  steps above 
  result:dd successfully,no error, and the guest is still in running status
         So I think the issue is related to lv

3.keep error_policy as 'stop',create a *2G* qcow2 lv directly,and start the guest with this lv. run "dd if=/dev/zero of=/dev/sdb bs=1M count=1358" 
   result: dd successfully,no error, and the guest is still in running status
          So I think the issue is related to resize operation on lv

4.repeate all the steps,but in step4,run "dd if=/dev/zero of=/dev/sdb bs=1M count=900"
   result: dd successfully,no error, and the guest is still in running status.But when count>=1000,the guest will be paused.


I am not sure if the steps are valid testing scenario. Could anyone help check its validity?
If it is vaild,could qemu guys help reproduce this issue?

Comment 11 lijuan men 2017-06-23 10:27:08 UTC

Created attachment 1290965 [details]
log in /var/log/libvirt/libvirtd.log from lmen

Comment 12 lijuan men 2017-06-23 10:28:33 UTC

Created attachment 1290966 [details]
error info in the guest from lmen

Comment 13 Xuesong Zhang 2017-06-23 14:03:18 UTC

(In reply to Avihai from comment #0)
......
> 
> Steps to Reproduce:
> 1.Clone VM from template with OS from iscsi/FC storage domain.
> 2.create additional thin disk from iscsi/FC with:
> size = 1G
> format = QCOW
> sparse = True


hi, 

Does RHV devel/QE help to let libvirt QE knows what libvirt(or other component) API RHV called while implementing the following extending disk action? then we (libvirt QE) can better judge whether the pure libvirt reproduce steps in comment10 is right or not.



> 3.Extend the created disk with an additonal 1G ( New total size 2G)
> 4.Create partition + FS (ext4) on new disk
> 5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~=
> 1358M), for example in the local run I saw:
> 
.......

Comment 14 Gil Klein 2017-06-25 14:11:09 UTC

Avihai, could you please help answer the libvirt qe questions in comment #13 ?

Comment 15 Avihai 2017-06-25 17:31:42 UTC

Hi Xuesong Zhang ,

I'm not sure if this is what you mean but here goes , Storage dev please DO correct me if I'm wrong:

There are 2 extends types & 2 matching API's used here:

1) user initiated disk extend (disk size from 1G -> 2G)

API: 
vdsm.api -> function 'extendVolumeSize' :

VDSM example:
2017-06-11 11:46:44,460+0300 INFO  (jsonrpc/6) [vdsm.api] START extendVolumeSize(spUUID='d87bb5e6-b7ad-4f7e-a577-4761c82aa0bd', sdUUID='df501232-7b90-4c2a-b0a6-6ed4570f3448', imgUUID='2a555c74-e36b-4345-9afd-65b181c80001', volUUID='277075e4-d02c-4f9c-8347-42c62515b012', newSize='2147483648') from=::ffff:10.35.69.94,45266, flow_id=disks_update_63c5069c-a77f-4656 (api:46)
2017-06-11 11:46:44,500+0300 DEBUG (tasks/2) [storage.TaskManager.Task] (Task='4da2373b-dd7c-4391-b007-ab58a16b46d9') Task.run: running job 0: extendVolumeSize: <bound method StoragePool.extendVolumeSize of <storage.sp.StoragePool object 
at 0x30eb210>> (args: ('df501232-7b90-4c2a-b0a6-6ed4570f3448', '2a555c74-e36b-4345-9afd-65b181c80001', '277075e4-d02c-4f9c-8347-42c62515b012', 4194304) kwargs: {}) (task:909)

2) As in storages, thin provisioned disk is growing(extending) in extents(fixed size) so storage only uses the used amount of disk size as user use(write IO) more extents.

As I understood from Yaniv K.(comment 5) API::
mailbox-spm -> storage.Misc.excCmd:

Example :
2017-06-11 11:47:21,486+0300 DEBUG (mailbox-spm) [storage.Misc.excCmd] SUCCESS: <err> = '1+0 records in\n1+0 records out\n1024000 bytes (1.0 MB) copied, 0.00332577 s, 308 MB/s\n'; <rc> = 0 (commands:93)

As you can see here extent size = 1MB.

Comment 16 lijuan men 2017-06-27 06:06:34 UTC

no matter which method to extend the size of the block(using vdsm api or virsh blockresize cmd),the results are the same: 'qemu-img info' can list the right size（2G）,and the size in the guest is also right.

So we think,the main problem occurs in the following operation: write IOs(dd) to the disk，we don't need to focus too much on the extending methods.

Maybe it is a qemu issue,so @juzhang,could you please find the suitable qemu guy to help reproduce the issue?

Comment 17 juzhang 2017-06-27 06:11:27 UTC

(In reply to lijuan men from comment #16)
> no matter which method to extend the size of the block(using vdsm api or
> virsh blockresize cmd),the results are the same: 'qemu-img info' can list
> the right size（2G）,and the size in the guest is also right.
> 
> So we think,the main problem occurs in the following operation: write
> IOs(dd) to the disk，we don't need to focus too much on the extending methods.
> 
> Maybe it is a qemu issue,so @juzhang,could you please find the suitable qemu
> guy to help reproduce the issue?

Hi Coli,

Could you handle this?

Best Regards,
Junyi

Comment 18 Avihai 2017-06-27 08:01:58 UTC

Adding my latest discussion with Yaniv K. , it looks like that in rhel7.4 write IO rate is higher than rhel7.3 which cause extend not to be triggered fast/early enough which trigger this bug.

Discussion highlights:
"
Me:
- I checked again with the current dd in rhel7.3 & the issue does not reproduce.
- Checking in rhel7.4 issue reproduce with the same build & current dd the issue reproduces right away.
- Slowing down the dd ( with if=/dev/urandom & oflag=direct ) did solve the issue but this is not the solution bu just an indication of the problem.

Original dd performed:
'/bin/dd bs=1M count=1358 if=/dev/vda of=/mount-point_2617184396/written_test_storage status=none'

Slow dd performed:
'/bin/dd bs=64k count=1358 if=/dev/urandom of=/mount-point_2618294159/written_test_storage status=none oflag=direct'

It's the same exact host so no change in the host itself but the upgraded to 7.4 which I did after checking the issue on 7.3 .
however, I did not measure IOP's (for some reason kill -USR1 <dd_PID> did not work) 

Anyway, it looks like 7.4 IOPs are higher which cause this issue to appear . 

Yaniv K.:
Exactly. And higher IOPS are a good thing. Question is if we should change our default 1g extend to a larger value. 
"

Comment 19 CongLi 2017-06-27 09:08:38 UTC

(In reply to Avihai from comment #15)
> 2) As in storages, thin provisioned disk is growing(extending) in
> extents(fixed size) so storage only uses the used amount of disk size as
> user use(write IO) more extents.

Hi Avihai,

Could you please provides the information about the thin provisioned disk ?

cat /sys/block/sd*/queue/discard_max_bytes
cat /sys/devices/***/provisioning_mode

Thanks.

Comment 20 Avihai 2017-06-27 19:40:39 UTC

(In reply to CongLi from comment #19)
> (In reply to Avihai from comment #15)
> > 2) As in storages, thin provisioned disk is growing(extending) in
> > extents(fixed size) so storage only uses the used amount of disk size as
> > user use(write IO) more extents.
> 
> Hi Avihai,
> 
> Could you please provides the information about the thin provisioned disk ?
> 
> cat /sys/block/sd*/queue/discard_max_bytes
Hi Cong, 

Do you mean on the VM itself or on the host ?

If you mean on VM than unfortunetly VM is paused so I can not connect to it & extract this info.

If you meant on the host(which the VM resides on) then this is what I get:

cat /sys/block/sd*/queue/discard_max_bytes
1073741824
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608
8388608


> cat /sys/devices/***/provisioning_mode

This did not work:
[root@storage-ge1-vdsm3 log]# cat /sys/devices/***/provisioning_mode
cat: /sys/devices/***/provisioning_mode: No such file or directory

But with find utility, I locate where there are provisioning mode files:
[root@storage-ge1-vdsm3 log]# find  /sys/devices/ -name provisioning_mode
/sys/devices/pci0000:00/0000:00:04.0/virtio1/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode
/sys/devices/platform/host3/session1/target3:0:0/3:0:0:1/scsi_disk/3:0:0:1/provisioning_mode
/sys/devices/platform/host3/session1/target3:0:0/3:0:0:2/scsi_disk/3:0:0:2/provisioning_mode
/sys/devices/platform/host3/session1/target3:0:0/3:0:0:3/scsi_disk/3:0:0:3/provisioning_mode
/sys/devices/platform/host3/session1/target3:0:0/3:0:0:4/scsi_disk/3:0:0:4/provisioning_mode
/sys/devices/platform/host3/session1/target3:0:0/3:0:0:5/scsi_disk/3:0:0:5/provisioning_mode
/sys/devices/platform/host3/session1/target3:0:0/3:0:0:6/scsi_disk/3:0:0:6/provisioning_mode
/sys/devices/platform/host4/session2/target4:0:0/4:0:0:1/scsi_disk/4:0:0:1/provisioning_mode
/sys/devices/platform/host4/session2/target4:0:0/4:0:0:2/scsi_disk/4:0:0:2/provisioning_mode
/sys/devices/platform/host4/session2/target4:0:0/4:0:0:3/scsi_disk/4:0:0:3/provisioning_mode
/sys/devices/platform/host4/session2/target4:0:0/4:0:0:4/scsi_disk/4:0:0:4/provisioning_mode
/sys/devices/platform/host4/session2/target4:0:0/4:0:0:5/scsi_disk/4:0:0:5/provisioning_mode
/sys/devices/platform/host4/session2/target4:0:0/4:0:0:6/scsi_disk/4:0:0:6/provisioning_mode
/sys/devices/platform/host5/session3/target5:0:0/5:0:0:1/scsi_disk/5:0:0:1/provisioning_mode
/sys/devices/platform/host5/session3/target5:0:0/5:0:0:2/scsi_disk/5:0:0:2/provisioning_mode
/sys/devices/platform/host5/session3/target5:0:0/5:0:0:3/scsi_disk/5:0:0:3/provisioning_mode
/sys/devices/platform/host5/session3/target5:0:0/5:0:0:4/scsi_disk/5:0:0:4/provisioning_mode
/sys/devices/platform/host5/session3/target5:0:0/5:0:0:5/scsi_disk/5:0:0:5/provisioning_mode
/sys/devices/platform/host5/session3/target5:0:0/5:0:0:6/scsi_disk/5:0:0:6/provisioning_mode
/sys/devices/platform/host6/session4/target6:0:0/6:0:0:1/scsi_disk/6:0:0:1/provisioning_mode
/sys/devices/platform/host6/session4/target6:0:0/6:0:0:2/scsi_disk/6:0:0:2/provisioning_mode
/sys/devices/platform/host6/session4/target6:0:0/6:0:0:3/scsi_disk/6:0:0:3/provisioning_mode
/sys/devices/platform/host6/session4/target6:0:0/6:0:0:4/scsi_disk/6:0:0:4/provisioning_mode
/sys/devices/platform/host6/session4/target6:0:0/6:0:0:5/scsi_disk/6:0:0:5/provisioning_mode

*Cat all these files produce:

[root@storage-ge1-vdsm3 log]# cat /sys/devices/platform/host*/session*/target*:0:0/*:0:0:*/scsi_disk/*:0:0:*/provisioning_mode
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16
writesame_16

[root@storage-ge1-vdsm3 log]# cat /sys/devices/pci0000:00/0000:00:04.0/virtio1/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode
unmap

> Thanks.

Comment 21 CongLi 2017-06-28 01:35:11 UTC

(In reply to Avihai from comment #20)
> Hi Cong, 
> 
> Do you mean on the VM itself or on the host ?
 
I mean the host.

> If you mean on VM than unfortunetly VM is paused so I can not connect to it
> & extract this info.
> 
> If you meant on the host(which the VM resides on) then this is what I get:
> 
> cat /sys/block/sd*/queue/discard_max_bytes
> 1073741824
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 8388608
> 
> 
> > cat /sys/devices/***/provisioning_mode
> 
> This did not work:
> [root@storage-ge1-vdsm3 log]# cat /sys/devices/***/provisioning_mode
> cat: /sys/devices/***/provisioning_mode: No such file or directory

Thanks Avihai.

I'm sorry to confuse you, I used my way to describe it.
Actually, '***' means your device info.
Like:
cat /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host11/rport-11:0-2/target11:0:0/11:0:0:1/scsi_disk/11:0:0:1/provisioning_mode
writesame_16


Could you please help confirm if this bug could be reproduced on a thick (non-thin provisioned) disk? 

I'm just thinking whether it's a thin provisioned disk problem.
FYI:
https://bugzilla.redhat.com/show_bug.cgi?id=1462594


Thanks.

Comment 22 Avihai 2017-06-28 05:28:15 UTC

(In reply to CongLi from comment #21)
> (In reply to Avihai from comment #20)
> > Hi Cong, 
> > 
> > Do you mean on the VM itself or on the host ?
>  
> I mean the host.
> 
> > If you mean on VM than unfortunetly VM is paused so I can not connect to it
> > & extract this info.
> > 
> > If you meant on the host(which the VM resides on) then this is what I get:
> > 
> > cat /sys/block/sd*/queue/discard_max_bytes
> > 1073741824
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 8388608
> > 
> > 
> > > cat /sys/devices/***/provisioning_mode
> > 
> > This did not work:
> > [root@storage-ge1-vdsm3 log]# cat /sys/devices/***/provisioning_mode
> > cat: /sys/devices/***/provisioning_mode: No such file or directory
> 
> Thanks Avihai.
> 
> I'm sorry to confuse you, I used my way to describe it.
> Actually, '***' means your device info.
> Like:
> cat
> /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/host11/rport-11:0-2/
> target11:0:0/11:0:0:1/scsi_disk/11:0:0:1/provisioning_mode
> writesame_16
> 
> 
> Could you please help confirm if this bug could be reproduced on a thick
> (non-thin provisioned) disk? 

This bug only occurs on a thin-provisioned disk.
I tried to reproduce on a RAW prealocated disk & issue did not reproduce.
 
> I'm just thinking whether it's a thin provisioned disk problem.
> FYI:
> https://bugzilla.redhat.com/show_bug.cgi?id=1462594
> 
> 
> Thanks.

Please tell me if you need any more info.

Comment 23 CongLi 2017-06-28 10:16:30 UTC

(In reply to Avihai from comment #22)
> > Could you please help confirm if this bug could be reproduced on a thick
> > (non-thin provisioned) disk? 
> 
> This bug only occurs on a thin-provisioned disk.
> I tried to reproduce on a RAW prealocated disk & issue did not reproduce.
>  
> > I'm just thinking whether it's a thin provisioned disk problem.
> > FYI:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1462594
> > 
> > 
> > Thanks.
> 
> Please tell me if you need any more info.

Hi Avihai,

Confirmed with storage QE, he thinks it's similar to BZ1462594, but even the 
developers are still investigating what is the problem on kernel.

I think you could ask kernel storage team for more investigation.

Thanks.

Comment 24 Avihai 2017-06-28 11:52:15 UTC

(In reply to CongLi from comment #23)
> (In reply to Avihai from comment #22)
> > > Could you please help confirm if this bug could be reproduced on a thick
> > > (non-thin provisioned) disk? 
> > 
> > This bug only occurs on a thin-provisioned disk.
> > I tried to reproduce on a RAW prealocated disk & issue did not reproduce.
> >  
> > > I'm just thinking whether it's a thin provisioned disk problem.
> > > FYI:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1462594
> > > 
> > > 
> > > Thanks.
> > 
> > Please tell me if you need any more info.
> 
> Hi Avihai,
> 
> Confirmed with storage QE, he thinks it's similar to BZ1462594, but even the 
> developers are still investigating what is the problem on kernel.
> 
> I think you could ask kernel storage team for more investigation.
> 
> Thanks.

Hi Cong,

Can you please add a NeedINFO request with the kernel storage team focal point?
Or if you can tell me who it & I'll do it . 

Thanks,
Thank

Comment 25 Yaniv Kaul 2017-06-28 12:04:09 UTC

Is your VM running with virtio-scsi and discard enabled? and what are you discarding exactly?

Comment 26 Avihai 2017-06-28 12:44:05 UTC

(In reply to Yaniv Kaul from comment #25)
> Is your VM running with virtio-scsi?
No, disk is with virtio interface.

> and discard enabled?
I'm not sure what you mean, but if you are talking about discard after delete of the storage domain than it is disabled.
 
> and what are you discarding exactly?
I'm not aware of intentionally wanting to discard anything during the test.

Comment 27 Yaniv Kaul 2017-06-28 13:16:35 UTC

(In reply to Avihai from comment #26)
> (In reply to Yaniv Kaul from comment #25)
> > Is your VM running with virtio-scsi?
> No, disk is with virtio interface.
> 
> > and discard enabled?
> I'm not sure what you mean, but if you are talking about discard after
> delete of the storage domain than it is disabled.
>  
> > and what are you discarding exactly?
> I'm not aware of intentionally wanting to discard anything during the test.

So the bug mentioned above ( https://bugzilla.redhat.com/show_bug.cgi?id=1462594 ) is completely irrelevant to this one.
And I'm still convinced it's the same as our known issue. Which you can easily prove by changing the allocation size for example.

Comment 28 Avihai 2017-06-28 13:32:30 UTC

(In reply to Yaniv Kaul from comment #27)
> (In reply to Avihai from comment #26)
> > (In reply to Yaniv Kaul from comment #25)
> > > Is your VM running with virtio-scsi?
> > No, disk is with virtio interface.
> > 
> > > and discard enabled?
> > I'm not sure what you mean, but if you are talking about discard after
> > delete of the storage domain than it is disabled.
> >  
> > > and what are you discarding exactly?
> > I'm not aware of intentionally wanting to discard anything during the test.
> 
> So the bug mentioned above (
> https://bugzilla.redhat.com/show_bug.cgi?id=1462594 ) is completely
> irrelevant to this one.
> And I'm still convinced it's the same as our known issue. Which you can
> easily prove by changing the allocation size for example.

Didn't we proove this already with the slow dd test ?
Why do we need another test to proove that ?
If slow dd test is enough please reassign accordigly .

If this test is really necessary then please clarify what you mean by changing allocation size with a clear scenario& I'll retest with hoping to provide more clarity on this issue.

Comment 29 Yaniv Kaul 2017-06-28 14:09:06 UTC

(In reply to Avihai from comment #28)
> (In reply to Yaniv Kaul from comment #27)
> > (In reply to Avihai from comment #26)
> > > (In reply to Yaniv Kaul from comment #25)
> > > > Is your VM running with virtio-scsi?
> > > No, disk is with virtio interface.
> > > 
> > > > and discard enabled?
> > > I'm not sure what you mean, but if you are talking about discard after
> > > delete of the storage domain than it is disabled.
> > >  
> > > > and what are you discarding exactly?
> > > I'm not aware of intentionally wanting to discard anything during the test.
> > 
> > So the bug mentioned above (
> > https://bugzilla.redhat.com/show_bug.cgi?id=1462594 ) is completely
> > irrelevant to this one.
> > And I'm still convinced it's the same as our known issue. Which you can
> > easily prove by changing the allocation size for example.
> 
> Didn't we proove this already with the slow dd test ?
> Why do we need another test to proove that ?
> If slow dd test is enough please reassign accordigly .
> 
> If this test is really necessary then please clarify what you mean by
> changing allocation size with a clear scenario& I'll retest with hoping to
> provide more clarity on this issue.

*** This bug has been marked as a duplicate of bug 1408594 ***

Comment 30 Raz Tamir 2017-07-03 11:56:31 UTC

Comment from Avihai:
Hi All,

It looks like it's indeed a rhel7.4 running IOPs faster that is the root cause here.

The following test was done in the same ovirt-guest-agent on latest 4.2 with rhel7.4 hosts to confirm it.

Test Details:
Engine: 4.2.0-0.0.master.20170627181935.git9424f9b
VDSM:   4.20.1-89

1) We added in /etc/vdsm/vdsm.conf the following 2 parameters in 'irs' section to start extend at 25%(instead of 50%) and to extend to 2G (instead of 1GB) & restared VDSM.  

[vars]
vm_watermark_interval = 2


[irs]
volume_utilization_percent = 25
volume_utilization_chunk_mb = 2048

2) Ran the same scenario as in the bug & issue did not reproduce - this was done twice.

3) When we returned to the default parameters - issue reproduced.

Comment 31 Allon Mureinik 2017-07-06 11:34:35 UTC

copy-and-bug, please ignore the previous bug marks as blocking this one.

Comment 32 Nir Soffer 2017-07-06 17:04:55 UTC

According to bug 1467826, thin disks are *never* extended on rhel 7.4. I believe
the what we see in this bug is:

1. Create 1g snapshot
2. Try to write 1.3g of data to disk
3. The vm pause

When you change ovirt defaults, using 
[irs]
volume_utilization_chunk_mb = 2048

We create a 2g disk, so copying 1.3g of data does not pause
the vm.

I don't see any data yet that supports the claim of higher io rate on 7.4.

When bug 1467826 is fixed, we should test this again.

Comment 33 Allon Mureinik 2017-07-10 12:12:45 UTC

Wrong window, my apologies. This is obviously critical for 4.1.4

Comment 34 Allon Mureinik 2017-07-27 12:12:58 UTC

Raz - based on the verification of the underlying issues, bug 1470127 and bug 1473145 - can this be set to VERIFIED?

Comment 35 Raz Tamir 2017-07-30 11:04:47 UTC

The bag is still here.

rhevm-4.1.5-0.1.el7.noarch
vdsm-4.19.25-1.el7ev.x86_64
qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64

engine.log:

2017-07-30 13:49:06,513+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED(1,025), Correlation ID: null, Call Stack: null, Custom Event 
ID: -1, Message: VM vm_TestCase5063_REST_ISCSI_3013412467 has been paused.
2017-07-30 13:49:06,630+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED_ENOSPC(138), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM vm_TestCase5063_REST_ISCSI_3013412467 has been paused due to no Storage space error.


vdsm.log:

2017-07-30 13:49:06,326+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,328+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onSuspend (vm:5048)
2017-07-30 13:49:06,328+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,329+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,337+0300 DEBUG (check/loop) [storage.asyncevent] Process <cpopen.CPopen object at 0x2f70c50> terminated (count=1) (asyncevent:565)
2017-07-30 13:49:06,338+0300 DEBUG (check/loop) [storage.check] FINISH check u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__local__ge5__nfs__1/ce1a3bf8-a91a-4f9e-afaf-a4a393e1c67e/dom_md/metadata' (rc=0, elapsed=0.05) (check:328)


* Logs attached

Comment 36 Raz Tamir 2017-07-30 11:05:51 UTC

Created attachment 1306556 [details]
new_logs

Comment 37 Raz Tamir 2017-07-30 11:08:18 UTC

(In reply to Raz Tamir from comment #35)
> The bag is still here.
The *bug* is still here.
> 
> rhevm-4.1.5-0.1.el7.noarch
> vdsm-4.19.25-1.el7ev.x86_64
> qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64
> 
> engine.log:
> 
> 2017-07-30 13:49:06,513+03 INFO 
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED(1,025), Correlation ID:
> null, Call Stack: null, Custom Event 
> ID: -1, Message: VM vm_TestCase5063_REST_ISCSI_3013412467 has been paused.
> 2017-07-30 13:49:06,630+03 ERROR
> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
> (ForkJoinPool-1-worker-9) [] EVENT_ID: VM_PAUSED_ENOSPC(138), Correlation
> ID: null, Call Stack: null, Custom Event ID: -1, Message: VM
> vm_TestCase5063_REST_ISCSI_3013412467 has been paused due to no Storage
> space error.
> 
> 
> vdsm.log:
> 
> 2017-07-30 13:49:06,326+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended
> (vm:4174)
> 2017-07-30 13:49:06,328+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onSuspend
> (vm:5048)
> 2017-07-30 13:49:06,328+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device
> virtio-disk1 error enospc (vm:4167)
> 2017-07-30 13:49:06,329+0300 INFO  (libvirt/events) [virt.vm]
> (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError
> (vm:5048)
> 2017-07-30 13:49:06,337+0300 DEBUG (check/loop) [storage.asyncevent] Process
> <cpopen.CPopen object at 0x2f70c50> terminated (count=1) (asyncevent:565)
> 2017-07-30 13:49:06,338+0300 DEBUG (check/loop) [storage.check] FINISH check
> u'/rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:
> _Storage__NFS_storage__local__ge5__nfs__1/ce1a3bf8-a91a-4f9e-afaf-
> a4a393e1c67e/dom_md/metadata' (rc=0, elapsed=0.05) (check:328)
> 
> 
> * Logs attached

Comment 38 Nir Soffer 2017-07-30 11:19:05 UTC

(In reply to Raz Tamir from comment #35)
> rhevm-4.1.5-0.1.el7.noarch
> vdsm-4.19.25-1.el7ev.x86_64
> qemu-kvm-rhev-2.9.0-16.el7_4.3.x86_64

Which libvirt version did you test? see bug 1467826

Comment 39 Raz Tamir 2017-07-30 11:30:07 UTC

I'm using libvirt-3.2.0-14.el7.x86_64 which is the same as bug #1470127 verified with.

Comment 40 Allon Mureinik 2017-07-30 11:34:06 UTC

(In reply to Raz Tamir from comment #39)
> I'm using libvirt-3.2.0-14.el7.x86_64 which is the same as bug #1470127
> verified with.
Is this a copy-paste bug (i.e., are you missing the last part)? Since that bug was verified with libvirt-3.2.0-14.el7_4.2

Comment 41 Raz Tamir 2017-07-30 11:53:52 UTC

No,
This is the version on my host.

We will wait for verification for the correct libvirt that should land soon

Keep this bug on NEW for now

Comment 42 Nir Soffer 2017-07-30 12:02:18 UTC

To test this bug, we need to compare the behavior with 7.3 and 7.4. Pausing during
extend flow is expected, we want to know if in 7.4 we have more pauses compared
with 7.3.

Also testing should be done with master or with the attached patch, improving 
logging for extend flow:
https://gerrit.ovirt.org/79969

Keeping the need info until we have tests results from both 7.3 and 7.4.

Comment 43 Nir Soffer 2017-07-30 16:33:55 UTC

Avihay, this test contains a thin disk extend before writing io to disk. Do
you have another test that write to disk without extending it?

For example:

1. Create thin disk of 2g
2. Attach disk to vm
3. Create file system and mount the disk
4. Copy 1.3g of data into new disk

Does this fail, or only when we extend the disk before the test it fails?

Comment 44 Nir Soffer 2017-07-30 16:44:30 UTC

Looking at the logs from attachment 1306556 [details]:

1. We don't have log for the extend operation, only for the completion:

2017-07-30 13:43:35,728+0300 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.diskSizeExtend succeeded in 0.15 seconds (__init__:539)

(I don't have any information when the disk copy was started, please include this
information in the next test)

2. The vm paused

2017-07-30 13:49:06,226+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,227+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)

3. We checked the drives, and no drive needs extension:

2017-07-30 13:49:06,326+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)

For some reason libvirt send more events about this...

2017-07-30 13:49:06,328+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onSuspend (vm:5048)
2017-07-30 13:49:06,328+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,329+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,343+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,344+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,345+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,357+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,358+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,358+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,370+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,371+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,372+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,384+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,385+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,386+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,398+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,399+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,399+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,411+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,412+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,413+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,425+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,426+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,426+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,438+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)
2017-07-30 13:49:06,439+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') abnormal vm stop device virtio-disk1 error enospc (vm:4167)
2017-07-30 13:49:06,440+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') CPU stopped: onIOError (vm:5048)
2017-07-30 13:49:06,452+0300 INFO  (libvirt/events) [virt.vm] (vmId='9549916b-3ac2-4f1f-bb65-21aeafa6a083') No VM drives were extended (vm:4174)

This mean that we did not get correct allocation from libvirt.

Need to check if this happens only after resizing an online disk, since
extension works in other flows, e.g. provisioning a vm.

Comment 45 Nir Soffer 2017-07-30 19:25:40 UTC

I tried to reproduce with:

libvirt-*-3.2.0-15.el7_rc.9a49cd47ef.x86_64 (scratch build for bug 1467826)
qemu-*-2.9.0-16.el7_4.2.skip_implicit.x86_64 (scratch build for bug 1470634)

I could not reproduce the issue, and I see correct allocation values from
libvirt, and correct operation in vdsm.

Tested with https://gerrit.ovirt.org/79015, rebase on top of
https://gerrit.ovirt.org/79974 for improved logging.

1. Starting io inside the vm:

I logged into vdsm log to make it easy to understand the flow 
# echo "`date +'%Y-%m-%d %H:%M:%S,%N'` starting io inside the vm to /dev/sdb1" >> /var/log/vdsm/vdsm.log

2017-07-30 20:50:01,945536470 starting io inside the vm to /dev/sdb1

1. We get correct allocation from libvirt

2017-07-30 20:50:04,861+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=174194688, physical=1073741824) (vm:1073)

2017-07-30 20:50:06,862+0300 DEBUG (periodic/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=364380160, physical=1073741824) (vm:1073)

3. We detect that the disk needs extension

2017-07-30 20:50:08,862+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=578945024, physical=1073741824) (vm:1073)

2017-07-30 20:50:08,863+0300 INFO  (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting extension for volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c on domain 6ffbc483-0031-403a-819b-3bb2f0f8de0a (apparent: 1073741824, capacity: 2147483648, allocated: 578945024, physical: 1073741824) (vm:1135)

2017-07-30 20:50:08,863+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting an extension for the volume: {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdc', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518770.25, 'internal': False, 'volumeID': u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c', 'imageID': u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee'} (vm:1212)

2017-07-30 20:50:08,863+0300 INFO  (periodic/0) [vdsm.api] START sendExtendMsg(spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', volDict={'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdc', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518770.25, 'internal': False, 'volumeID': u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c', 'imageID': u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee'}, newSize=2147483648, callbackFunc=<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>) from=internal, task_id=66f7b4d1-0c64-4b1a-8ca9-68220c468c0c (api:46)

2017-07-30 20:50:08,863+0300 INFO  (periodic/0) [vdsm.api] FINISH sendExtendMsg return=None from=internal, task_id=66f7b4d1-0c64-4b1a-8ca9-68220c468c0c (api:52)

2017-07-30 20:50:08,876+0300 INFO  (mailbox-hsm) [storage.MailBox.HsmMailMonitor] HSM_MailMonitor sending mail to SPM - ['/usr/bin/dd', 'of=/rhev/data-center/6c99f4e5-8588-46f5-a818-e11151c1d19c/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=4096', 'count=1', 'seek=1'] (mailbox:387)

4. Vm disk continue to to become full...

2017-07-30 20:50:10,868+0300 DEBUG (periodic/3) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=732037120, physical=1073741824) (vm:1073)

5. Getting extend completed message from spm the logical volume was extended
   on the SPM but this host does not see the new size yet...

2017-07-30 20:50:10,921+0300 INFO  (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 198a9ea5-e9f5-43e7-8cb5-5a8ce04a1c92 (cmd=<function runTask at 0x276f758>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>, {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdc', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518770.25, 'internal': False, 'volumeID': u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c', 'imageID': u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee'})) (threadPool:208)

6. Refreshing the logical volume

2017-07-30 20:50:10,923+0300 INFO  (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID=u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', imgUUID=u'2c831846-5c5e-41cc-8f29-ea4b5aaf0dee', volUUID=u'943fd0d7-af8b-4257-85d4-e7df39dc3b7c') from=internal, task_id=b6a4b66a-3363-408a-b1db-68fe8c4d5f0b (api:46)

7. Refresh started, the vm does not see the new size yet (physical=1073741824)

2017-07-30 20:50:12,861+0300 DEBUG (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=732037120, physical=1073741824) (vm:1073)

8. The refresh did not complete yet, but the vm see the new size (physical=2147483648)

2017-07-30 20:50:14,859+0300 DEBUG (periodic/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdc volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c (capacity=2147483648, allocated=747765760, physical=2147483648) (vm:1073)

9. Refresh volume completed

2017-07-30 20:50:15,141+0300 INFO  (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=b6a4b66a-3363-408a-b1db-68fe8c4d5f0b (api:52)

2017-07-30 20:50:15,142+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extend volume 943fd0d7-af8b-4257-85d4-e7df39dc3b7c completed in 6.28 seconds (vm:1175)


Here is a another run where extend took 3.3 seconds:

1. Detecting that drive need extend

2017-07-30 20:38:34,748+0300 DEBUG (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive vdb volume 77406f37-729a-420c-85f9-0b3da825236b (capacity=2147483648, allocated=585236480, physical=1073741824) (vm:1073)

2. Sending extend message

2017-07-30 20:38:34,748+0300 INFO  (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting extension for volume 77406f37-729a-420c-85f9-0b3da825236b on domain 6ffbc483-0031-403a-819b-3bb2f0f8de0a (apparent: 1073741824, capacity: 2147483648, allocated: 585236480, physical: 1073741824) (vm:1135)

3. Allocation continue to grow...

2017-07-30 20:38:36,753+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive vdb volume 77406f37-729a-420c-85f9-0b3da825236b (capacity=2147483648, allocated=703725568, physical=1073741824) (vm:1073)

4. Got extend completed message from the spm

2017-07-30 20:38:37,022+0300 INFO  (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 399f1dda-5c32-4ab7-ab1e-73eb1ad5e283 (cmd=<function runTask at 0x276f758>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>, {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'vdb', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6518076.13, 'internal': False, 'volumeID': u'77406f37-729a-420c-85f9-0b3da825236b', 'imageID': u'861f1134-1b98-4d93-9f2c-af43ed8b33a1'})) (threadPool:208)

5. Refreshing the drive...

2017-07-30 20:38:37,024+0300 INFO  (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID=u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', imgUUID=u'86

2017-07-30 20:38:37,025+0300 INFO  (mailbox-hsm/2) [storage.LVM] Refreshing LVs (vg=6ffbc483-0031-403a-819b-3bb2f0f8de0a, lvs=(u'77406f37-729a-420c-85f9-0b3da825236b',)) (lvm:1322)

2017-07-30 20:38:38,098+0300 INFO  (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=e03becf8-9957-43a2-ba4d-35e5f100658e (api:52)

6. Extend completed

2017-07-30 20:38:38,099+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extend volume 77406f37-729a-420c-85f9-0b3da825236b completed in 3.36 seconds (vm:1175)


And another run when extend took 9.8 seconds

1. Detecting that drive need extend

2017-07-30 20:17:02,617+0300 DEBUG (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=584187904, physical=1073741824) (vm:1073)

2. Requesting extension

2017-07-30 20:17:02,617+0300 INFO  (periodic/0) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Requesting extension for volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 on domain 6ffbc483-0031-403a-819b-3bb2f0f8de0a (apparent: 1073741824, capacity: 2147483648, allocated: 584187904, physical: 1073741824) (vm:1135)

3. Allocation continue to grow...

2017-07-30 20:17:04,616+0300 DEBUG (periodic/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=784596992, physical=1073741824) (vm:1073)

4. Got extend completed message from spm

2017-07-30 20:17:04,679+0300 INFO  (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 2e16b2cf-f590-414e-a9d5-720e4bb496e6 (cmd=<function runTask at 0x276f758>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x36d2610>>, {'newSize': 2147483648, 'domainID': u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', 'name': 'sdb', 'poolID': u'6c99f4e5-8588-46f5-a818-e11151c1d19c', 'started': 6516784.0, 'internal': False, 'volumeID': u'1befe02c-d021-47e8-b4e7-62dcdd6503d6', 'imageID': u'508c3cde-4178-434d-b9ba-22b2053bfd7e'})) (threadPool:208)

5. Refreshing logical volume

2017-07-30 20:17:04,680+0300 INFO  (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID=u'6ffbc483-0031-403a-819b-3bb2f0f8de0a', spUUID=u'6c99f4e5-8588-46f5-a818-e11151c1d19c', imgUUID=u'508c3cde-4178-434d-b9ba-22b2053bfd7e', volUUID=u'1befe02c-d021-47e8-b4e7-62dcdd6503d6') from=internal, task_id=de92fe0e-a5c7-4a10-9272-fc4495468e6d (api:46)

6. Vm does not see the new size yet...

2017-07-30 20:17:08,621+0300 DEBUG (periodic/1) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=900296704, physical=1073741824) (vm:1073)

7. Vm see the new size

2017-07-30 20:17:10,617+0300 DEBUG (periodic/3) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extension info for drive sdb volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 (capacity=2147483648, allocated=1029173248, physical=2147483648) (vm:1073)

8. Refresh completed

2017-07-30 20:17:12,441+0300 INFO  (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=de92fe0e-a5c7-4a10-9272-fc4495468e6d (api:52)

2017-07-30 20:17:12,442+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='b9751033-8db8-4b1d-9b1d-779b38a5315f') Extend volume 1befe02c-d021-47e8-b4e7-62dcdd6503d6 completed in 9.83 seconds (vm:1175)

The longest part of the extend is the lvm refresh operation.
- extend: 2 seconds
- refresh: 7.8 seconds
  (vm see new size: 6 seconds)

We need better logging showing how much time took each part of the flow,
so we can collect more data.


I tested with my poor storage, giving up to 70 MiB/s write rate inside the
vm. It is possible that using faster storage the vm would pause while copying,
but this is not news.

Comment 46 Nir Soffer 2017-07-30 20:05:18 UTC

Avihay, please test again with master, using this patch:
https://gerrit.ovirt.org/79015

Please set "virt" logger log level to DEBUG:

    vdsm-client Host setLogLevel name=virt level=DEBUG

You will have a log with extension info every 2 seconds for each drive:

2017-07-30 22:49:53,664+0300 DEBUG (periodic/0) [virt.vm] (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extension info for drive sdb volume 5aa55005-2922-41ed-8f2a-3072d0d86f02 (capacity=2147483648, allocated=641925120, physical=1073741824) (vm:1073)

When extending a volume completes, we will have this log:

2017-07-30 22:49:55,609+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extend volume 5aa55005-2922-41ed-8f2a-3072d0d86f02 completed in 3.95 seconds (extend=2.12 seconds, refresh=1.83 seconds) (vm:1187)

It would be helpful to run only this test in automation at least 10 times with
7.3 and 7.4.

Comment 47 Avihai 2017-07-31 05:10:31 UTC

(In reply to Nir Soffer from comment #43)
> Avihay, this test contains a thin disk extend before writing io to disk.
No , we do the extend BEFORE writing to disk .

Steps to Reproduce:
1.Clone VM from template with OS from iscsi/FC storage domain.
2.create additional thin disk from iscsi/FC with:
size = 1G
format = QCOW
sparse = True

3.Extend the created disk with an additonal 1G ( New total size 2G)
4.Create partition + FS (ext4) on new disk
5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~= 1358M), for example in the local run I saw:


> Do you have another test that write to disk without extending it?
No but if it help I can modify current test & check it out . 

> For example:
> 
> 1. Create thin disk of 2g
> 2. Attach disk to vm
> 3. Create file system and mount the disk
> 4. Copy 1.3g of data into new disk



> Does this fail, or only when we extend the disk before the test it fails?
Will test  & update.

Comment 48 Avihai 2017-07-31 05:58:02 UTC

(In reply to Avihai from comment #47)
> (In reply to Nir Soffer from comment #43)
> > Avihay, this test contains a thin disk extend before writing io to disk.
> No , we do the extend BEFORE writing to disk .
> 
> Steps to Reproduce:
> 1.Clone VM from template with OS from iscsi/FC storage domain.
> 2.create additional thin disk from iscsi/FC with:
> size = 1G
> format = QCOW
> sparse = True
> 
> 3.Extend the created disk with an additonal 1G ( New total size 2G)
> 4.Create partition + FS (ext4) on new disk
> 5.Write IO via dd to the newly created FS (dd size = Disk size - 600M ~=
> 1358M), for example in the local run I saw:
> 
> 
> > Do you have another test that write to disk without extending it?
> No but if it help I can modify current test & check it out . 
> 
> > For example:
> > 
> > 1. Create thin disk of 2g
> > 2. Attach disk to vm
> > 3. Create file system and mount the disk
> > 4. Copy 1.3g of data into new disk
> 
> 
> 
> > Does this fail, or only when we extend the disk before the test it fails?
> Will test  & update.

Issue (VM pause) still occur without extend in your suggested scenario.



Engine:
2017-07-31 08:48:46,897+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-7) [] EVENT_ID: VM_PAUSED_ENOSPC(138), VM vm_TestCase5063_REST_ISCSI_3108445080 has been paused due to no Storage space error.

VDSM:
2017-07-31 08:48:46,644+0300 INFO  (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') abnormal vm stop device virtio-disk1 error enospc (vm:4501)
2017-07-31 08:48:46,645+0300 INFO  (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') CPU stopped: onIOError (vm:5431)
2017-07-31 08:48:46,655+0300 INFO  (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') No VM drives were extended (vm:4508)
2017-07-31 08:48:46,656+0300 INFO  (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') CPU stopped: onSuspend (vm:5431)
2017-07-31 08:48:46,657+0300 INFO  (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') abnormal vm stop device virtio-disk1 error eother (vm:4501)
2017-07-31 08:48:46,657+0300 INFO  (libvirt/events) [virt.vm] (vmId='8a38814f-a31b-4ffd-9644-35aa691be639') CPU stopped: onIOError (vm:5431)

Logs attached.

Comment 49 Avihai 2017-07-31 06:00:44 UTC

Created attachment 1306809 [details]
Test_without_extend_logs_4.2.0-0.0.master.2017072314102

Added log to the suggested scenario without extend .

Engine:
4.2.0-0.0.master.20170723141021.git463826a.el7.centos

VDSM:
4.20.1-269

Comment 50 Avihai 2017-07-31 11:39:15 UTC

(In reply to Nir Soffer from comment #46)
> Avihay, please test again with master, using this patch:
> https://gerrit.ovirt.org/79015
> 
> Please set "virt" logger log level to DEBUG:
> 
>     vdsm-client Host setLogLevel name=virt level=DEBUG
> 
> You will have a log with extension info every 2 seconds for each drive:
> 
> 2017-07-30 22:49:53,664+0300 DEBUG (periodic/0) [virt.vm]
> (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extension info for drive sdb
> volume 5aa55005-2922-41ed-8f2a-3072d0d86f02 (capacity=2147483648,
> allocated=641925120, physical=1073741824) (vm:1073)
> 
> When extending a volume completes, we will have this log:
> 
> 2017-07-30 22:49:55,609+0300 INFO  (mailbox-hsm/2) [virt.vm]
> (vmId='73bb2a51-94a6-493e-a967-1969a0c89268') Extend volume
> 5aa55005-2922-41ed-8f2a-3072d0d86f02 completed in 3.95 seconds (extend=2.12
> seconds, refresh=1.83 seconds) (vm:1187)
> 
> It would be helpful to run only this test in automation at least 10 times
> with
> 7.3 and 7.4.

Nir , I ran same scenario as in original issue on this patch twice on 7.4 with virt DEBUG enabled & issue reporoduce on both runs.

See logs to 7.4 run with DEBUG attached. 

I do not have a 7.3 hosts setup at the moment - is this critical?

Comment 51 Avihai 2017-07-31 11:43:13 UTC

Created attachment 1307036 [details]
engine & vdsm logs no patch 79015 with virt DEBUG enabled

Comment 52 Nir Soffer 2017-07-31 12:54:56 UTC

(In reply to Avihai from comment #50)
> Nir , I ran same scenario as in original issue on this patch twice on 7.4
> with virt DEBUG enabled & issue reporoduce on both runs.

Thanks, will look at the logs.

> I do not have a 7.3 hosts setup at the moment - is this critical?

It is to tell if we have a regression in 7.4.

Comment 53 Nir Soffer 2017-07-31 12:58:58 UTC

(In reply to Avihai from comment #51)
> Created attachment 1307036 [details]
> engine & vdsm logs no patch 79015 with virt DEBUG enabled

Avihay, I need to time when io was started - best way is to include this in 
vdsm logs like I did in comment 45, or provide the time separately.

Comment 54 Nir Soffer 2017-07-31 13:03:17 UTC

Avihay, your are using a broken version of libvirt in your reproduction:

$ grep 'Extension info for drive vdb' vdsm.log | grep -v allocated=0

No output

$ grep 'Extension info for drive vdb' vdsm.log | head
2017-07-31 14:29:40,768+0300 DEBUG (periodic/3) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:42,758+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:44,755+0300 DEBUG (periodic/0) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:46,762+0300 DEBUG (periodic/2) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:48,756+0300 DEBUG (periodic/3) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:51,242+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=1073741824, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:52,750+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:54,748+0300 DEBUG (periodic/3) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:56,823+0300 DEBUG (periodic/1) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073)
2017-07-31 14:29:58,782+0300 DEBUG (periodic/0) [virt.vm] (vmId='7afdaca9-140c-48d7-b4be-4cd5883335de') Extension info for drive vdb volume 72a65baa-f204-440b-80ed-2250e6d80eaf (capacity=2147483648, allocated=0, physical=1073741824) (vm:1073)

If libvirt does not tell us about the allocation, we cannot extend.

Comment 58 Avihai 2017-07-31 15:20:47 UTC

Created attachment 1307188 [details]
logs with new libvirt

Added logs for passed test.

Time dd started 2017-07-31 18:11:28

Comment 59 Nir Soffer 2017-07-31 15:39:41 UTC

Avihay, looking at logs from attachment 1307188 [details], everything looks ok. This means also
that previous tests are invalid, we used the wrong libvirt version.

1. We detect that a drive should be extended

2017-07-31 18:11:41,316+0300 DEBUG (periodic/2) [virt.vm] (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Requesting an extension for the volume: {'newSize': 2147483648, 'domainID': '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305203.34, 'internal': False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': '8e7a842c-993b-47a6-978c-c3a3de652a42'} (vm:1224)

2. We get extend completed message from spm

2017-07-31 18:11:41,395+0300 INFO  (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 5fd68ec2-dd7b-4536-88f6-0d04dac32a9f (cmd=<function runTask at 0x2e9d1b8>, args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x3fe6410>>, {'newSize': 2147483648, 'domainID': '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID': '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305201.31, 'internal': False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID': '8e7a842c-993b-47a6-978c-c3a3de652a42'})) (threadPool:208)

3. Refreshing volume

2017-07-31 18:11:41,398+0300 INFO  (mailbox-hsm/2) [vdsm.api] START refreshVolume(sdUUID='84b5768a-ca74-4ce7-ada3-7961f43f69f1', spUUID='8a8b6a4a-3123-4785-ae56-ccc4339493af', imgUUID='8e7a842c-993b-47a6-978c-c3a3de652a42', volUUID='920aaefd-1aa1-40d2-bdaa-08c299657acc') from=internal, task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:46)

2017-07-31 18:11:41,623+0300 INFO  (mailbox-hsm/2) [storage.LVM] Refreshing LVs (vg=84b5768a-ca74-4ce7-ada3-7961f43f69f1, lvs=('920aaefd-1aa1-40d2-bdaa-08c299657acc',)) (lvm:1322)

2017-07-31 18:11:41,904+0300 INFO  (mailbox-hsm/2) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:52)

4. Extend completed

2017-07-31 18:11:41,905+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11 seconds, refresh=0.50 seconds) (vm:1187)

Can you run this test at least 10 times on the same setup, so we can compare
failures rates with 7.3?

Comment 60 Nir Soffer 2017-07-31 21:42:55 UTC

Since we don't have yet any real failure with current libvirt, I think we should move this to 4.1.5.

Comment 61 Avihai 2017-08-01 03:36:56 UTC

(In reply to Nir Soffer from comment #59)
> Avihay, looking at logs from attachment 1307188 [details], everything looks
> ok. This means also
> that previous tests are invalid, we used the wrong libvirt version.
> 
> 1. We detect that a drive should be extended
> 
> 2017-07-31 18:11:41,316+0300 DEBUG (periodic/2) [virt.vm]
> (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Requesting an extension for
> the volume: {'newSize': 2147483648, 'domainID':
> '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID':
> '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305203.34, 'internal':
> False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID':
> '8e7a842c-993b-47a6-978c-c3a3de652a42'} (vm:1224)
> 
> 2. We get extend completed message from spm
> 
> 2017-07-31 18:11:41,395+0300 INFO  (mailbox-hsm/2)
> [storage.ThreadPool.WorkerThread] START task
> 5fd68ec2-dd7b-4536-88f6-0d04dac32a9f (cmd=<function runTask at 0x2e9d1b8>,
> args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at
> 0x3fe6410>>, {'newSize': 2147483648, 'domainID':
> '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID':
> '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305201.31, 'internal':
> False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID':
> '8e7a842c-993b-47a6-978c-c3a3de652a42'})) (threadPool:208)
> 
> 3. Refreshing volume
> 
> 2017-07-31 18:11:41,398+0300 INFO  (mailbox-hsm/2) [vdsm.api] START
> refreshVolume(sdUUID='84b5768a-ca74-4ce7-ada3-7961f43f69f1',
> spUUID='8a8b6a4a-3123-4785-ae56-ccc4339493af',
> imgUUID='8e7a842c-993b-47a6-978c-c3a3de652a42',
> volUUID='920aaefd-1aa1-40d2-bdaa-08c299657acc') from=internal,
> task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:46)
> 
> 2017-07-31 18:11:41,623+0300 INFO  (mailbox-hsm/2) [storage.LVM] Refreshing
> LVs (vg=84b5768a-ca74-4ce7-ada3-7961f43f69f1,
> lvs=('920aaefd-1aa1-40d2-bdaa-08c299657acc',)) (lvm:1322)
> 
> 2017-07-31 18:11:41,904+0300 INFO  (mailbox-hsm/2) [vdsm.api] FINISH
> refreshVolume return=None from=internal,
> task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:52)
> 
> 4. Extend completed
> 
> 2017-07-31 18:11:41,905+0300 INFO  (mailbox-hsm/2) [virt.vm]
> (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume
> 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11
> seconds, refresh=0.50 seconds) (vm:1187)
> 
> Can you run this test at least 10 times on the same setup, so we can compare
> failures rates with 7.3?

1) Current setup is on 4.2 + new libvirt+ you patch  (https://gerrit.ovirt.org/79015) -> do you want 10 runs on this setup ?

2) Do you have the 7.3 rates for comparison or is this also needed with 10 additional runs (take in consideration that we currently do not have RHEL7.3 stand)?

Comment 66 Avihai 2017-08-01 11:05:31 UTC

(In reply to Nir Soffer from comment #59)
> Avihay, looking at logs from attachment 1307188 [details], everything looks
> ok. This means also
> that previous tests are invalid, we used the wrong libvirt version.
> 
> 1. We detect that a drive should be extended
> 
> 2017-07-31 18:11:41,316+0300 DEBUG (periodic/2) [virt.vm]
> (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Requesting an extension for
> the volume: {'newSize': 2147483648, 'domainID':
> '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID':
> '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305203.34, 'internal':
> False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID':
> '8e7a842c-993b-47a6-978c-c3a3de652a42'} (vm:1224)
> 
> 2. We get extend completed message from spm
> 
> 2017-07-31 18:11:41,395+0300 INFO  (mailbox-hsm/2)
> [storage.ThreadPool.WorkerThread] START task
> 5fd68ec2-dd7b-4536-88f6-0d04dac32a9f (cmd=<function runTask at 0x2e9d1b8>,
> args=(<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at
> 0x3fe6410>>, {'newSize': 2147483648, 'domainID':
> '84b5768a-ca74-4ce7-ada3-7961f43f69f1', 'name': 'vdb', 'poolID':
> '8a8b6a4a-3123-4785-ae56-ccc4339493af', 'started': 4305201.31, 'internal':
> False, 'volumeID': '920aaefd-1aa1-40d2-bdaa-08c299657acc', 'imageID':
> '8e7a842c-993b-47a6-978c-c3a3de652a42'})) (threadPool:208)
> 
> 3. Refreshing volume
> 
> 2017-07-31 18:11:41,398+0300 INFO  (mailbox-hsm/2) [vdsm.api] START
> refreshVolume(sdUUID='84b5768a-ca74-4ce7-ada3-7961f43f69f1',
> spUUID='8a8b6a4a-3123-4785-ae56-ccc4339493af',
> imgUUID='8e7a842c-993b-47a6-978c-c3a3de652a42',
> volUUID='920aaefd-1aa1-40d2-bdaa-08c299657acc') from=internal,
> task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:46)
> 
> 2017-07-31 18:11:41,623+0300 INFO  (mailbox-hsm/2) [storage.LVM] Refreshing
> LVs (vg=84b5768a-ca74-4ce7-ada3-7961f43f69f1,
> lvs=('920aaefd-1aa1-40d2-bdaa-08c299657acc',)) (lvm:1322)
> 
> 2017-07-31 18:11:41,904+0300 INFO  (mailbox-hsm/2) [vdsm.api] FINISH
> refreshVolume return=None from=internal,
> task_id=ae8953f8-1c8f-4dc0-8fae-0100340aafef (api:52)
> 
> 4. Extend completed
> 
> 2017-07-31 18:11:41,905+0300 INFO  (mailbox-hsm/2) [virt.vm]
> (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume
> 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11
> seconds, refresh=0.50 seconds) (vm:1187)
> 
> Can you run this test at least 10 times on the same setup, so we can compare
> failures rates with 7.3?

I ran 10 times on the same setup (RHEL7.4 , new libvirt, your patch) .
Logs are attached

Comment 67 Avihai 2017-08-01 11:07:53 UTC

Created attachment 1307513 [details]
10 runs logs on RHEL7_4 with new libvirt & patch

Comment 68 Nir Soffer 2017-08-02 00:27:54 UTC

Avihay, did you reproduce the issue? the previous build seems to be fine.

Comment 69 Nir Soffer 2017-08-02 00:51:55 UTC

Looking in the logs from attachment 1307513 [details]:

$ grep "Extend volume" vdsm.log 
2017-07-31 18:11:41,905+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='fffdf91e-640d-4a0c-9376-391eb877cc1f') Extend volume 920aaefd-1aa1-40d2-bdaa-08c299657acc completed in 2.61 seconds (extend=2.11 seconds, refresh=0.50 seconds) (vm:1187)
2017-07-31 18:19:02,305+0300 INFO  (mailbox-hsm/4) [virt.vm] (vmId='f9e0b544-4c50-45a8-9ef1-f899e927eba7') Extend volume e7c20240-9045-4d13-b74a-7872daf18f5a completed in 2.72 seconds (extend=2.14 seconds, refresh=0.58 seconds) (vm:1187)
2017-07-31 18:30:00,511+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='93dfee3f-36fd-4f5e-b619-85a8b8a76171') Extend volume 55931609-dadb-4a9e-a8b4-d0acddaff678 completed in 2.66 seconds (extend=2.09 seconds, refresh=0.57 seconds) (vm:1187)
2017-08-01 11:51:06,012+0300 INFO  (mailbox-hsm/1) [virt.vm] (vmId='a242043a-96ce-4224-a5f4-8ec486aa6bfb') Extend volume c26196cf-420d-47b1-afbc-1eed99128e09 completed in 2.49 seconds (extend=2.10 seconds, refresh=0.39 seconds) (vm:1187)
2017-08-01 11:55:28,238+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='e06d0f3c-fb61-48b6-afe3-cd050c37694f') Extend volume 0b17fd30-33b1-46c9-875d-a80e4744cfc5 completed in 4.73 seconds (extend=4.12 seconds, refresh=0.61 seconds) (vm:1187)
2017-08-01 12:13:34,515+0300 INFO  (mailbox-hsm/2) [virt.vm] (vmId='8add541b-020a-471f-9e80-26ae03453bf5') Extend volume 373cb281-6d28-4858-9e9e-6df9240be0ce completed in 2.64 seconds (extend=2.10 seconds, refresh=0.54 seconds) (vm:1187)
2017-08-01 12:36:20,938+0300 INFO  (mailbox-hsm/1) [virt.vm] (vmId='195b8335-d352-4063-99b9-bf87fd765e63') Extend volume e4631228-7ed9-449b-b56a-ecb6fd34dbe5 completed in 2.58 seconds (extend=2.08 seconds, refresh=0.50 seconds) (vm:1187)
2017-08-01 12:45:27,140+0300 INFO  (mailbox-hsm/1) [virt.vm] (vmId='ba4d8d03-46b1-4335-96d7-69aa6971fc9e') Extend volume 2f6ec7e1-15f2-43fe-9c78-3617caff74cc completed in 2.58 seconds (extend=2.13 seconds, refresh=0.45 seconds) (vm:1187)

We had 8 extensions - maybe the test run only 8 times instead of 10?
or maybe vdsm log is partial?

All extends are looks similar, one was longer. This is expected since we
check for messages every 2 seconds, if the spm sent a message just after
we checked, we will see the message 2 seconds later.

refresh times look short, I don't see the random and long refresh times
as seen on my host.

$ egrep "vm stop device .+ error enospc" vdsm.log
(no output)

Avihay, I don't see any pause, I think we can safely close this bug.
The pauses you reported were a result of broken libvirt as explained
in comment 32.

Comment 70 Avihai 2017-08-02 05:14:49 UTC

(In reply to Nir Soffer from comment #68)
> Avihay, did you reproduce the issue? the previous build seems to be fine.

No , issue did not reproduce at any run (over 10+ runs)using new libvirt .

Comment 71 Avihai 2017-08-02 11:20:01 UTC

Verified .

Engine:
ovirt-engine-4.1.4-0.2.el7.noarch

Host:
VDSM :4.19.25-1
New libvirt: libvirt-3.2.0-14.el7_4.2.x86_64

Comment 72 Nir Soffer 2017-08-02 13:16:43 UTC

Changing target per Tal request.

Comment 73 Nir Soffer 2017-08-03 17:14:41 UTC

Revert milestone per Allon request, changing target instead.

Comment 74 Nir Soffer 2017-08-09 00:12:10 UTC

I backported the logging improvements added when debugging this issue, so we can
have better logging in 4.1.